Data Analysis

Data Analysis and Information
Processing
DATA ANALYSIS AND
INFORMATION PROCESSING
Edited by:
Jovan Pehcevski
ARCLER
P r e s s
www.arclerpress.com
Data analysis and Information Processing
Jovan Pehcevski
Arcler Press
224 Shoreacres Road
Burlington, ON L7L 2H2
Canada
www.arclerpress.com
Email: orders@arclereducation.com
e-book Edition 2023
ISBN: 978-1-77469-579-1 (e-book)
This book contains information obtained from highly regarded resources. Reprinted
material sources are indicated. Copyright for individual articles remains with the au-
thors as indicated and published under Creative Commons License. A Wide variety of
references are listed. Reasonable efforts have been made to publish reliable data and
views articulated in the chapters are those of the individual contributors, and not neces-
sarily those of the editors or publishers. Editors or publishers are not responsible for
the accuracy of the information in the published chapters or consequences of their use.
The publisher assumes no responsibility for any damage or grievance to the persons or
property arising out of the use of any materials, instructions, methods or thoughts in the
book. The editors and the publisher have attempted to trace the copyright holders of all
material reproduced in this publication and apologize to copyright holders if permission
has not been obtained. If any copyright holder has not been acknowledged, please write
to us so we may rectify.
Notice: Registered trademark of products or corporate names are used only for explana-
tion and identification without intent of infringement.
© 2023 Arcler Press
ISBN: 978-1-77469-526-5 (Hardcover)
Arcler Press publishes wide variety of books and eBooks. For more information about
Arcler Press and its products, visit our website at www.arclerpress.com
DECLARATION
Some content or chapters in this book are open access copyright free
published research work, which is published under Creative Commons
License and are indicated with the citation. We are thankful to the
publishers and authors of the content and chapters as without them this
book wouldn’t have been possible.
ABOUT THE EDITOR
Jovan currently works as a presales Technology Consultant at Dell Technologies. He

is a result-oriented technology leader with demonstrated subject matter expertise in
planning, architecting and managing ICT solutions to reflect business objectives and
achieve operational excellence. Jovan has broad deep technical knowledge in the fields
of data center and big data technologies, combined with consultative selling approach
and exceptional client-facing presentation skills. Before joining Dell Technologies
in 2017, Jovan spent nearly a decade as a researcher, university professor and IT
business consultant. In these capacities, he served as a trusted advisor to a multitude
of customers in financial services, health care, retail, and academic sectors. He holds
a PhD in Computer Science from RMIT University in Australia and worked as a post-
doctoral visiting scientist at the renowned INRIA research institute in France. He is a
proud father of two, an aspiring tennis player, and an avid Science Fiction/Fantasy book
reader.”
TABLE OF CONTENTS
List of Contributors........................................................................................xv
List of Abbreviations..................................................................................... xxi
Preface................................................................................................... ....xxiii
Section 1: Data Analytics Methods
Chapter 1 Data Analytics in Mental Healthcare.................................... 3

Abstract...................................................................................................... 3
Introduction................................................................................................ 4
Literature Review........................................................................................ 5
Mental Illness and its Type.......................................................................... 8
Effects of Mental Health on User Behavior................................................ 12
How Data Science Helps to Predict Mental Illness?.................................. 14
Conclusions.............................................................................................. 20
Acknowledgments.................................................................................... 20
References................................................................................................ 21
Chapter 2 Case Study on Data Analytics and Machine Learning Accuracy............... 25

Abstract.................................................................................................... 25
Introduction.............................................................................................. 26
Research Methodology............................................................................. 27
Cyber-Threat Dataset Selection................................................................. 29
Ml Algorithms Selection............................................................................ 35
Accuracy of Machine Learning................................................................. 49
Conclusion............................................................................................... 51
Acknowledgements.................................................................................. 52
References................................................................................................ 53
Chapter 3 Data Modeling and Data Analytics: A Survey from a
Big Data Perspective................................................................................ 55
Abstract.................................................................................................... 55
Introduction.............................................................................................. 56
Data Modeling.......................................................................................... 58
Data Analytics.......................................................................................... 67
Discussion................................................................................................ 74
Related Work............................................................................................ 77
Conclusions.............................................................................................. 78
References................................................................................................ 81
Chapter 4 Big Data Analytics for Business Intelligence in Accounting and Audit..... 87
Abstract.................................................................................................... 87
Introduction.............................................................................................. 88
Machine Learning..................................................................................... 90
Data Analytics.......................................................................................... 93
Data Visualization..................................................................................... 97
Conclusion............................................................................................... 98
References.............................................................................................. 100
Chapter 5 Big Data Analytics in Immunology: A Knowledge-Based Approach....... 101

Abstract.................................................................................................. 101
Introduction............................................................................................ 102
Materials and Methods........................................................................... 106
Results and Discussion........................................................................... 110
Conclusions............................................................................................ 116
References.............................................................................................. 118
Section 2: Big Data Methods
Chapter 6 Integrated Real-Time Big Data Stream Sentiment Analysis Service........ 125
Abstract.................................................................................................. 125
Introduction............................................................................................ 126
Related Works........................................................................................ 130
Architecture of Big Data Stream Analytics Framework............................. 131
x
Sentiment Model.................................................................................... 133
Experiments............................................................................................ 142
Conclusions............................................................................................ 146
Acknowledgements................................................................................ 147
References.............................................................................................. 148
Chapter 7 The Influence of Big Data Analytics in the Industry............................... 153

Abstract.................................................................................................. 153
Introduction............................................................................................ 154
Status Quo Overview.............................................................................. 155
Big-Data Analysis................................................................................... 157
Conclusions............................................................................................ 165
References.............................................................................................. 166
Chapter 8 Big Data Usage in the Marketing Information System............................ 169

Abstract.................................................................................................. 169
Introduction............................................................................................ 170
The Use of Information on the Decision-Making Process in Marketing.... 171
Big Data................................................................................................. 171
Use of Big Data in the Marketing Information System............................. 173
Limitations.............................................................................................. 178
Final Considerations............................................................................... 180
References.............................................................................................. 183
Chapter 9 Big Data for Organizations: A Review.................................................... 187

Abstract.................................................................................................. 187
Introduction............................................................................................ 188
Big Data for Organizations..................................................................... 188
Big Data in Organizations and Information Systems................................ 192
Conclusion............................................................................................. 196
References.............................................................................................. 198
Chapter 10 Application Research of Big Data Technology in Audit Field................. 201

Abstract.................................................................................................. 201
Introduction............................................................................................ 202
xi
Overview of Big Data Technology........................................................... 202
Requirements on Auditing in the Era of Big Data..................................... 203
Application of Big Data Technology in Audit Field.................................. 205
Risk Analysis of Big Data Audit............................................................... 210
Conclusion............................................................................................. 211
References.............................................................................................. 212
Section 3: Data Mining Methods
Chapter 11 A Short Review of Classification Algorithms Accuracy for Data

Prediction in Data Mining Applications................................................. 215
Abstract.................................................................................................. 215
Introduction............................................................................................ 216
Methods in Literature.............................................................................. 218
Results and Discussion........................................................................... 224
Conclusions and Future Work................................................................. 227
References.............................................................................................. 228
Chapter 12 Different Data Mining Approaches Based Medical Text Data................ 231
Abstract.................................................................................................. 231
Introduction............................................................................................ 232
Medical Text Data................................................................................... 232
Medical Text Data Mining....................................................................... 233
Discussion.............................................................................................. 246
Acknowledgments.................................................................................. 247
References.............................................................................................. 248
Chapter 13 Data Mining in Electronic Commerce: Benefits and Challenges............. 257

Abstract.................................................................................................. 257
Introduction............................................................................................ 258
Data Mining........................................................................................... 260
Some Common Data Mining Tools.......................................................... 261
Data Mining in E-Commerce.................................................................. 262
Benefits of Data Mining in E-Commerce................................................. 264
Challenges of Data Mining in E-Commerce............................................ 267
Summary and Conclusion....................................................................... 270
References.............................................................................................. 271
xii
Chapter 14 Research on Realization of Petrophysical Data Mining
Based on Big Data Technology............................................................... 273
Abstract.................................................................................................. 273
Introduction............................................................................................ 274
Analysis of Big Data Mining of Petrophysical Data.................................. 274
Mining Based on K-Means Clustering Analysis........................................ 278
Conclusions............................................................................................ 283
References.............................................................................................. 285
Section 4: Information Processing Methods
Chapter 15 Application of Spatial Digital Information Fusion Technology in

Information Processing of National Traditional Sports........................... 289
Abstract.................................................................................................. 289
Introduction............................................................................................ 290
Related Work.......................................................................................... 291
Space Digital Fusion Technology............................................................ 293
Information Processing of National Traditional Sports Based
on Spatial Digital Information Fusion............................................ 305
Conclusion............................................................................................. 309
References.............................................................................................. 310
Chapter 16 Effects of Quality and Quantity of Information Processing

on Design Coordination Performance.................................................... 313
Abstract.................................................................................................. 313
Introduction............................................................................................ 314
Methods................................................................................................. 317
Data Analysis.......................................................................................... 320
Discussion.............................................................................................. 321
Conclusion............................................................................................. 322
References.............................................................................................. 324
Chapter 17 Neural Network Optimization Method and its Application

in Information Processing...................................................................... 327
Abstract.................................................................................................. 327
Introduction............................................................................................ 328
xiii
Neural Network Optimization Method and its Research in
Information Processing.................................................................. 330
Neural Network Optimization Method and its Experimental
Research In Information Processing............................................... 337
Neural Network Optimization Method and its Experimental
Research Analysis in Information Processing................................. 338
Conclusions............................................................................................ 347
References.............................................................................................. 349
Chapter 18 Information Processing Features Can Detect Behavioral Regimes of

Dynamical Systems................................................................................ 353
Abstract.................................................................................................. 353
Introduction............................................................................................ 354
Methods................................................................................................. 356
Results.................................................................................................... 365
Discussion.............................................................................................. 380
References.............................................................................................. 384
Index...................................................................................................... 389
xiv
LIST OF CONTRIBUTORS
Ayesha Kamran Ul haq

National University of Computer and Emerging Sciences, Islamabad, Pakistan
Amira Khattak
Prince Sultan University, Riyadh, Saudi Arabia
Noreen Jamil
M. Asif Naeem
Auckland University of Technology, Auckland, New Zealand
Farhaan Mirza
Abdullah Z. Alruhaymi
Department of Electrical Engineering and Computer Science, Howard University,
Washington D.C, USA.
Charles J. Kim
André Ribeiro
INESC-ID/Instituto Superior Técnico, Lisbon, Portugal
Afonso Silva
Alberto Rodrigues da Silva

Mui Kim Chu

Singapore Institute of Technology, 10 Dover Drive, Singapore
Kevin Ow Yong
Guang Lan Zhang

Department of Computer Science, Metropolitan College, Boston University, Boston,
MA 02215, USA
Jing Sun
Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston,
MA 02115, USA
Lou Chitkushev
MA 02215, USA
Vladimir Brusic
MA 02215, USA
Sun Sunnie Chung

Department of Electrical Engineering and Computer Science, Cleveland State
University, Cleveland, USA
Danielle Aring
Department of Electrical Engineering and Computer Science, Cleveland State
University, Cleveland, USA
Haya Smaya
Mechanical Engineering Faculty, Institute of Technology, MATE Hungarian University
of Agriculture and Life Science, Gödöllo”, Hungary
Alexandre Borba Salvador

Faculdade de Administra??o, Economia e Ciências Contábeis, Universidade de S?o
Paulo, S?o Paulo, Brazil
Ana Akemi Ikeda

Pwint Phyu Khine

School of Information and Communication Engineering, University of Science and
Technology Beijing (USTB), Beijing, China
xvi
Wang Zhao Shun
Beijing Key Laboratory of Knowledge Engineering for Material Science, Beijing, China
Guanfang Qiao
WUYIGE Certified Public Accountants LLP, Wuhan, China
Ibrahim Ba’abbad
Department of Information Technology, Faculty of Computing and Information
Technology, King Abdulaziz University, Jeddah, KSA.
Thamer Althubiti
Abdulmohsen Alharbi
Khalid Alfarsi
Saim Rasheed
Wenke Xiao
School of Medical Information Engineering, Chengdu University of Traditional Chinese
Medicine, Chengdu 611137, China
Lijia Jing
School of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu
611137, China
Yaxin Xu
Shichao Zheng
xvii
Yanxiong Gan
Chuanbiao Wen
Mustapha Ismail
Management Information Systems Department, Cyprus International University,
Haspolat, Lefkoşa via Mersin, Turkey
Mohammed Mansur Ibrahim

Zayyan Mahmoud Sanusi

Muesser Nat
Yu Ding
School of Computer Science, Yangtze University, Jingzhou, China
Key Laboratory of Exploration Technologies for Oil and Gas Resources (Yangtze
University), Ministry of Education, Wuhan, China
Rui Deng
School of Geophysics and Oil Resource, Yangtze University, Wuhan, China
Chao Zhu
The Internet and Information Center, Yangtze University, Jingzhou, China
Xiang Fu
School of Physical Education Guangdong Polytechnic Normal University, Guangzhou
510000, China
Ye Zhang
510000, China
xviii
Ling Qin
510000, China
R. Zhang
Department of Quantity Survey, School of Construction Management and Real Estate,
Chongqing University, Chongqing, China.
A. M. M. Liu
Department of Real Estate and Construction, Faculty of Architecture, The University of
Hong Kong, Hong Kong, China
I. Y. S. Chan
Pin Wang
School of Mechanical and Electrical Engineering, Shenzhen Polytechnic, Shenzhen
518055, Guangdong, China
Peng Wang
Garden Center, South China Botanical Garden, Chinese Academy of Sciences,
Guangzhou 510650, Guangdong, China
En Fan
Department of Computer Science and Engineering, Shaoxing University, Shaoxing
312000, Zhejiang, China
Rick Quax
Computational Science Lab, University of Amsterdam, Amsterdam, Netherlands
Gregor Chliamovitch
Department of Computer Science, University of Geneva, Geneva, Switzerland
Alexandre Dupuis
Jean-Luc Falcone
Bastien Chopard
xix
Alfons G. Hoekstra
ITMO University, Saint Petersburg, Russia
Peter M. A. Sloot
Complexity Institute, Nanyang Technological University, Singapore
xx
LIST OF ABBREVIATIONS
EMC: The U.S.A EMC company

NLP: Natural language processing
ANN: Artificial neural network
BP: Backpropagation
ROC: Receiver operating characteristic
TPR: True positive rate
FPR: False positive rate
AUC: Area under the curve
RF: Random forest
BN: Bayesian network
EHR: Electronic health record
ICD: International classification of diseases
SNOMED CT: The systematized nomenclature of human and veterinary medicine
clinical terms
CPT: Current procedural terminology
DRG: Diagnosis-related groups
Mesh: Medical subject headings
LOINC: Logical observation identifiers names and codes
UMLS: Unified medical language system
MDDB: Main drug database
SVM: Support vector machine
RNN: Recurrent neural network
ID3: Iterative Dichotomiser 3
KCHS: Korean community health survey
ADR: Adverse drug reactions
ECG: electrocardiographic
FNN: Factorization machine-supported neural network.
PREFACE
Over the past few decades, the development of information systems in larger enterprises
was accompanied with the development of data storage technology. Initially, the
information systems of individual departments were developed independently of each
other, so that, for example, the finance department had a separate information system
from the human resources department. The so-called ‘information islands’ were created,
among which the flow of information was not established. If a company has offices in
more than one country, until recently it was the practice for each country to have a
separate information system, which was necessary due to differences in legislation,
local customs and the problem of remote customer support. Such systems often had
different data structures. The problem arose with reporting, as there was no easy way
to aggregate data from diverse information systems to get a picture of the state of the
entire enterprise.
The main task of information engineering was to merge separate information systems
into one logical unit, from which unified data can be obtained. The first step in the
unification process is to create a company model. This consists of the following steps:
defining data models, defining process models, identifying participants, and determining
the flow of information between participants and systems (data flow diagram).
The problem of unavailability of information in practice is bigger than it may seem.
Certain types of businesses, especially non-profit-oriented ones, can operate in this
way. However, a large company, which sells its main product on the market for a
month at the wrong price, due to inaccurate information obtained by the management
from a poor information system, will surely find itself in trouble. The organization’s
dependence on quality information from the business system grows with its size and
the geographical dislocation of its offices. Full automation of all business processes is
now the practical standard in some industries. Examples are airline reservation systems,
or car manufacturer systems where it is possible to start the production of car models
with accessories directly from the showroom according to the customer’s wishes. The
fashion industry, for example, must effectively follow fashion trends (analyze sales of
different clothing models by region) in order to respond quickly to changes in consumer
habits.
This edition covers different topics from data analysis and information processing,
including: data analytics methods, big data methods, data mining methods, and
information processing methods.
Section 1 focuses on data analytics methods, describing data analytics in mental
healthcare, a case study on data analytics and machine learning accuracy, a survey
from a big data perspective on data modeling and data analytics, big data analytics for
business intelligence in accounting and audit, and a knowledge-based approach for big
data analytics in immunology.
Section 2 focuses on big data methods, describing integrated real-time big data stream
sentiment analysis service, the influence of big data analytics in the industry, big data
usage in the marketing information system, a review of big data for organizations, and
application research of big data technology in audit field.
Section 3 focuses on data mining methods, describing a short review of classification
algorithms accuracy for data prediction in data mining applications, different data mining
approaches based on medical text data, the benefits and challenges of data mining in
electronic commerce, and research on realization of petrophysical data mining based on
big data technology.
Section 4 focuses on information processing methods, describing application of spatial
digital information fusion technology in information processing of national traditional
sports, the effects of quality and quantity of information processing on design
coordination performance, a neural network optimization method and its application in
information processing, and information processing features that can detect behavioral
regimes of dynamical systems.
xxiv
SECTION 1:
DATA ANALYTICS METHODS
Chapter 1
Data Analytics in Mental Healthcare
Ayesha Kamran Ul haq1, Amira Khattak2, Noreen Jamil1, M. Asif Naeem1,3-, and
Farhaan Mirza3
1
2
Prince Sultan University, Riyadh, Saudi Arabia
3
ABSTRACT
Worldwide, about 700 million people are estimated to suffer from mental
illnesses. In recent years, due to the extensive growth rate in mental disorders,
it is essential to better understand the inadequate outcomes from mental
health problems. Mental health research is challenging given the perceived
limitations of ethical principles such as the protection of autonomy, consent,
threat, and damage. In this survey, we aimed to investigate studies where big
data approaches were used in mental illness and treatment. Firstly, different
types of mental illness, for instance, bipolar disorder, depression, and
Citation: Ayesha Kamran Ul Haq, Amira Khattak, Noreen Jamil, M. Asif Naeem, Far-
haan Mirza, “Data Analytics in Mental Healthcare”, Scientific Programming, vol. 2020,
Article ID 2024160, 9 pages, 2020. https://doi.org/10.1155/2020/2024160.
Copyright: © 2020 by Authors. This is an open access article distributed under the
Creative Commons Attribution License, which permits unrestricted use, distribution,
and reproduction in any medium, provided the original work is properly cited.
4 Data Analysis and Information Processing
personality disorders, are discussed. The effects of mental health on user’s

behavior such as suicide and drug addiction are highlighted. A description
of the methodologies and tools is presented to predict the mental condition
of the patient under the supervision of artificial intelligence and machine
learning.
INTRODUCTION
Recently the term “big data” has become exceedingly popular all over the
world.
Over the last few years, big data has started to set foot in healthcare
system. In this context, scientists have been working on improving the
public health strategies, medical research, and the care provided to patients
by analyzing big datasets related to their health.
Data is coming from different sources like providers (pharmacy and
patient’s history) and nonproviders (cell phone and internet searches). One
of the outstanding possibilities available from huge data utilization is evident
inside the healthcare industry. Healthcare organizations have a big quantity
of information available to them and a big portion of it is unstructured and
clinically applicable. The use of big data is expected to grow in the medical
field and it will continue to pose lucrative opportunities for solutions that can
help in saving lives of patients. Big data needs to be interpreted correctly in
order to predict future data so that final result can be estimated. To solve this
problem, researchers are working on AI algorithms that have a high impact
on analysis of huge quantities of raw data and extract useful information
from it. There are varieties of AI algorithms that are used to predict patient
disease by observing past data. A variety of wearable sensors have been
developed to deal with both physical and social interactions practically.
Mental health of a person is measured by a high grade of affective disorder
which results in major depression and different anxiety disorders. There are
many conditions which are recognized as mental disorders including anxiety
disorder, depressive disorder, mood disorder, and personality disorder.
There are lots of mobile apps, smart devices like smartwatches, and smart
bands which increase healthcare facilities in mobile mental healthcare
systems. Personalized psychiatry also plays an important role in predicting
bipolar disorder and improving diagnosis and optimized treatment. Most
of the smart techniques are not pursued due to lack of resources especially
in underdeveloping countries. Like, in Pakistan, 0·1% of the government
health budget is being spent on the mental health system. There is a need
Data Analytics in Mental Healthcare 5
for an affordable solution to detect depression in Pakistan so that everyone

could be able to pay attention to it.
Researchers are working on many machine learning algorithms to
analyze raw data to deduce meaningful information. It is now impossible
to manage data in healthcare with traditional database management tools as
data is in terabytes and petabytes now. In this survey, we analyzed different
issues related to mental healthcare by usage of big data. We analyze different
mental disorders like bipolar disease, opioid use disorder, personality
disorder, different anxiety disorders, and depression. Social media is one of
the biggest and most powerful resources for data collection as every 9 out of
10 people use social networking sites nowadays. Twitter is the main focus of
interest for most researchers as people write 500,000 tweets on average per
minute. Twitter is being used for sentimental analyses and opinion mining in
the business field in order to check the popularity of a product by observing
customer tweets. We have a lot of structure and unstructured data in order
to reach any decision; data must be processed and stored in such a manner
that follows the same structure. We analyzed and compared the working
of different storage models under different conditions like mongo DB
and Hadoop which are two different approaches to store large amounts of
data. Hadoop works on cloud computing that helps to accomplish different
operations on distributed data in a systematic manner.
In this survey we discuss the mental health problems with big data into
further four sections. The second section describes related work regarding
mental healthcare and the latest research on it. The third section describes
different types of mental illness and their solutions within the data science.
The fourth section describes the different illegal issues faced by the mental
patients and early detection of these types of activities. The fifth section
describes different approaches of data science towards mental healthcare
systems such as different training and testing methods of health data for
early prediction like supervised and unsupervised learning methods and
artificial neural network (ANN).
LITERATURE REVIEW
There are a lot of mental disorders like bipolar one, depression, and different
forms of anxieties. Bauer et al. [1] conducted a paper-based survey in which
1222 patients from 17 countries were participated to detect bipolar disorder
in adults. This survey was translated into 12 different languages with some
limitation that it did not contain any question about technology usage in
older adults. According to Bauer et al. [1], digital treatment is not suitable
for the older adults with bipolar disorder.
Researchers are working on the most interesting and unique method of
tremendous interest to check the personality of a person just by looking at
the way he or she is using the mobile phone. De Montjoye [2] collected
dataset from US Research University and created a framework that analyzed
phone call and text messages to check the personality of the user. Participants
who did 300 calls or text per year failed to complete personality measures.
They choose optimal sample size that is 69 with mean age = 30.4, S. D. = 6.1,
and 1 missing value. Similarly, Bleidorn and Hopwood [3] adopted a
comprehensive machine learning approach to test the personality of the user
using social media and digital records. Main 9 recommendations for how to
amalgamate machine learning techniques provided by the researcher enhance
the big five of the personality assessments. Focusing on minor details of the
user comprehends and validates the result. Digital mental health has been
revolutionized and its innovations are growing at a high rate. The National
Health Service (NHS) has recognized its importance in mental healthcare
and is looking for innovations to provide services at low cost. Hill et al. [4]
presented a study of challenges and considerations in innovations in digital
mental healthcare. They also suggested collaboration between clinicians,
industry workers, and service users so that these challenges can be overcome
and successful innovations of e-therapies and digital apps can be developed.
There are lots of mobile apps, smart devices like smartwatches, smart
bands, and shirts which increase healthcare facilities in the mobile healthcare
system. A variety of wearable sensors have been developing to deal with both
physical and social interactions practically. Combining artificial intelligence
with healthcare systems extends the healthcare facilities up to the next
level. Dimitrov [5] conducted a systematic survey on mobile internet of
things in the devices which allow business to emerge, spread productivity
improvements, lock down the cost, and intensify customer experience and
change in a positive way. Similarly, Monteith et al. [6] performed a paper-
based survey on clinical data mining to analyze different data sources to get
psychiatry data and optimized precedence opportunities for psychiatry.
One of the machine learning algorithms named artificial neural network
(ANN) is based on three-layer architecture. Kellmeyer [7] introduced
a way to secure big brain data from clinical and consumer-directed
neurotechnological devices using ANN. But this model needs to be trained
on a huge amount of data to get accurate results. Jiang et al. [8] designed
and developed a wearable device with multisensing capabilities including

audio sensing, behavior monitoring, and environment and physiological
sensing that evaluated speech information and automatically deleted raw
data. Tested students were split into two groups, those with excessive scores
or in excessive score. Participants were required to wear the device to make
sure of the authenticity of the data. But one of the major challenges to enable
IoT in the device is safe communication.
Yang et al. [9] invented an IoT enabled wearable device for mental well-
being and some external equipment to record speech data. This portable
device would be able to recognize motion, pressure, monitoring, and
physiological status of a person. There are lots of technologies that produce
tracking data, such as smartphones, credit cards, websites, social media,
and sensors offering benefits. Monteith and Glenn [10] elaborated some
kind of generated data using human made algorithm, searching for disease
symptoms, hit disease websites, sending/receiving healthcare e-mail, and
sharing health information on social media. Based on perceived data, the
system predicted automated decision-making without the involvement of
user to maintain security.
Considering all the above issues, there is a need for proper treatment of
a disordered person. Mood of the patient is one of the parameters to detect
his/her mental health. Public mood is hugely reflected in the social media as
almost everyone uses social media in this modern era. Goyal [11] introduced
a procedure in which tweets are filtered out for specific keywords from saved
databases regarding food price crisis. Data is trained using two algorithms,
K nearest neighbor and Naïve Bayes for unsupervised and supervised
learning, respectively. Cloud storage is the best option to store huge
amounts of unstructured data. Kumar and Bala [12] proposed functionalities
of Hadoop for automatic processing and repository of big data. MongoDB
is a big data tool for analyzing statistics related to world mental healthcare.
Dhaka, P., and Johari [13] presented a way of implementation of big data
tool ‘MongoDB’ for analyzing statistics related to world mental healthcare.
The data is further analyzed using genetic algorithms for different mental
disorders and deployed again in MongoDB for extracting final data.
But all of the above methods are useless without the user involvement.
De Beurs et al. [14] introduced expert-driven method, intervention mapping,
and scrum methods which may help to increase the involvement of the users.
This approach tried to develop user-focused design strategies for the growth
of web-based mental healthcare under finite resources. Turner et al. [15]
elaborated in their article that the availability of the big data is increasing
twice in size every two year for use in automated decision-making. Passos
et al. [16] believed that the long-established connection between doctor and
patient will change with the establishment of big data and machine learning
models. ML algorithm can allow an affected person to observe his fitness
from time to time and can tell the doctor about his current condition if it
becomes worst. Early consultation with the doctor could prevent any bigger
loss for the patient.
If the psychiatric disease is not predicted or handled earlier, then it
enforces the patient to involve into many illegal activities like suicide as
most of the suicide attempts are related to mental disorder. Kessler et al.
[17] proposed meta-analysis that focused on suicide incidence within 1 year
of the self-harm using machine learning algorithm. They analyzed the past
reports of suicide patients and concluded that any prediction was impossible
to be made due to short duration of psychiatric hospitalizations. Although a
number of AI algorithms are used to estimate patient disease by observing
past data, the focus of all studies was related to suicide prediction by setting
up a threshold. Defining a threshold is a very crucial point or sometimes
even impossible to be predicted. Cleland et al. [18] reviewed many studies
but were unable to discover principles to clarify threshold. Authors used
a random-effects model to generate a meta-analytic ROC. On the basis of
correlation results, it is stated that depression prevalence is mediating factor
between economic deprivation and antidepressant prescribing.
Another side effect of mental disease is drug addiction. Early drug
prediction is possible by analyzing user data. Opioid is a swear type of drug.
Hasan et al. [19] explored the Massachusetts All Payer Claim Data (MA
APCD) dataset and examined how naïve users develop opioid use disorder.
A popular machine learning algorithm is tested to predict the risk of such
type of dependency of patent. Perdue et al. [20] predicted ratio of drug
abusers by comparing Google trends data with monitoring the future (MTF)
data; a well-structured study was made. It is concluded that Google trends
and MTF data provided combined support for detecting drug abuse.
MENTAL ILLNESS AND ITS TYPE
Depression and Bipolar Disorder

Bipolar disorder is also known as the worst form of depression. In Table
1, Bauer et al. [1] conducted a survey to check the bipolar disorder in
adults. Data is collected from 187 older adults and 1021 younger adults
with excluded missing observations. The survey contained 39 questions
which took 20 minutes to complete. Older adults with bipolar disorder were
addicted to the internet less regularly than the younger ones. As most of
the healthcare services are available only online and most digital tools and
devices are evolved, the survey has some limitations that it did not contain
any question about technology usage in older adults. There is a need for
proper treatment of a disordered person. Mood of the patient is one of
the parameters to detect his/her mental health. Table 1 describes another
approach of personality assessment using machine learning algorithm that
focused on other aspects like systematic fulfillment and argued to enhance
the validity of machine learning (ML) approach. Coming with technological
advancement in the medical field will promote personalized treatments. A
lot of work has been done in the field of depression detection using social
networks.
Table 1: Types of mental illness and role of big data
Authors Discipline(s) Keywords used to Methodology Number Primary findings

reviewed identify papers for of papers
review reviewed
Bauer et al. Bipolar Bipolar disorder, Paper-based 68 47% of older adults

[1] disorder mental illness, and survey used the internet
health literacy versus 87% of
younger adults
having bipolar
disorder
Dhaka and Mental Mental health, Genetic 19 Analyzing and

Johari [13] disorder disorders, and using algorithm and storing a large
MongoDB MongoDB amount of data on
tool MongoDB
Hill et al. Mental Mental health, col- (i) Online 33 (i) Developing
[4] disorder laborative comput- CBT platform smartphone ap-
ing, and e-therapies plication
(ii) Collabora- (ii) For mental
tive computing disorder
(iii) For improving
e-therapies
Kumar and Depression Big data, Hadoop, Sentimental 14 Analyzing twitter

Bala, [12] detection sentiment analysis, analysis and users’ view on a
through so- social networks, and save data on particular business
cial media Twitter Hadoop product
Kellmeyer Big brain Brain data, neu- (i) Machine 77 (i) Maximizing
[7] data rotechnology, big learning medical knowledge
data, privacy, secu- (ii) Consumer- (ii) Enhancing the
rity, and machine directed neu- security of devices
learning rotechnologi- and sheltering the
cal devices privacy of personal
(iii) Combin- brain data
ing expert
with a bottom-
up process
De Mont- Mobile Personality predic- (i) Entropy: 31 Analyzing phone

joye [2] phone and tion, big data, big detecting calls and text
user person- five personality different cat- messages under a
ality prediction, Carrier’s egories five-factor model
log, and CDR (ii) Inter-
event time:
frequency of
call or text
between two
users
(iii) AR
coefficients:
to convert list
of call and
text into time
series
Furnham Personality Dark side, big five, Hogan ‘dark 34 All of the personal-
[21] disorder facet analysis, side’ measure ity disorders are
dependence, and (HDS) concept strongly negatively
dutifulness of dependent associated with
personal- agreeableness
ity disorder (a type of kind,
(DPD) sympathetic, and
cooperative per-
sonality)
Bleidorn Personality Machine learning, (i) Machine 65 Focusing on

and Hop- assessment personality as- learning other aspects like
wood [3] sessment, big five, (ii) Prediction systematic fulfill-
construct validation, models ment and arguing
and big data to enhance the
(iii) K-fold validity of machine
validation learning (ML) ap-
proach
The main goal of personalized psychiatry is to predict bipolar disorder

and improve diagnosis and optimized treatment. To achieve these goals, it is
necessary to combine the clinical variables of a patient as Figure 1 describes
the integration of all these variables. It is now impossible to manage data in
mental healthcare with database management traditional tools as data is in
terabytes and petabytes now. So, there is a high need to introduce big data
analytics tools and techniques to deal with such big data in order to improve
the quality of treatment so that overall cost of treatment can be reduced
throughout the world.
Figure 1: Goals of personalized treatment in bipolar disorder [22].

MongoDB is one of the tools to handle big data. The data is further
analyzed using genetic algorithms for different mental disorders and
deployed again in MongoDB for extracting final data. This approach
of mining data and extracting useful information reduced overall cost of
treatment. It provides the best results for clinical decisions. It helps doctors
to give more accurate treatment for several mental disorders in less time and
at low cost using useful information extracted by big data tool Mongo DB
and genetic algorithm.
In Table 1, some of the techniques are handled and stored huge amount
of data.
Using MongoDB tool, researchers are working to predict mental
condition before severe mental stage. So, some devices introduced a
complete detection process to tackle the present condition of the user by
analyzing his/her daily life routine. There is a need for reasonable solutions
that detect disable stage of a mental patient more precisely and quickly.
Personality Disorder
Dutifulness is a type of personality disorder in which patients are overstressed
about the disease that is not actually much serious. People with this type of
disorder tend to work hard to impress others. A survey was conducted to
find the relationship between normal and dutifulness personalities. Other
researchers are working on the most interesting and unique method of
tremendous interest to check the personality of a person just by looking at
the way he or she is using the mobile phone. This approach provides cost-
effective and questionnaire-free personality detection through mobile phone
data that performs personality assessment without conducting any digital
survey on social media. To perform all nine main aspects of the constructed
validation in real time is not easy for the researchers. This examination, like
several others, has limitations. This is just a sample that has implications for
generalization when it is used in the near-real-time scenario which may be
tough for the researchers.
EFFECTS OF MENTAL HEALTH ON USER

BEHAVIOR
Mental illness is upswing in the feelings of helplessness, danger, fear, and
sadness in the people. People do not understand the current situation so this
thing imposes psychiatric patients to illegal activities. Table 2 described
some issues that appear because of mental disorder like suicide, drug abuse,
and opioid use as follows.
Table 2: Side effects of mental illness and their solution through data science
Authors Side effects of Tools/techniques Primary findings

mental disorder
Kessler et al. Suicide and Machine learning Predicting suicide risk at

[17] mental illness algorithm hospitalization
Cleland et al. Antidepressant Clustering analysis Identifying the correla-

[18] usage based on behavior and tion between antidepres-
disease sant usage and depriva-
tion
Perdue et al. Drug abuse (i) Google search his- Providing real time data
[20] tory that may allow us to
(ii) Monitoring the predict drug abuse ten-
future (MTF) dency and respond more
quickly
Hasan et al. Opioid use (iii) Feature engineer- Suppressing the in-
[19] disorder ing creasing rate of opioid
(iv) Logistic regression addiction using machine
learning algorithms
(v) Random forest
(vi) Gradient boosting
Suicide
Suicide is very common in underdeveloped countries. According to
researchers, someone dies because of suicide in every 40 seconds all over
the world. There are some areas in the world where mental disorder and
suicide statistics are relatively larger than other areas.
Psychiatrists say that 90% of people who died by suicide faced a mental
disorder. Electronic medical records and big data generate suicide through
machine learning algorithm. Machine learning algorithms can be used to
predict suicides in depressed persons; it is hard to estimate how accurately it
performs, but it may help a consultant for pretreating patients based on early
prediction. Various studies depict the fact that there are a range of factors
such as high level of antidepressant prescribing that caused such prevalence
of illness. Some people started antidepressant medicine to overcome mental
affliction. In Table 1, Cleland et al. [18] explored three main factors, i.e.,
economic deprivation, depression prevalence, and antidepressant prescribing

and their correlations. Several statistical tools could be used like Jupyter
Notebook, Pandas, NumPy, Matplotlib, Seaborn, and ipyleaflet for creation
of pipeline. Correlations are analyzed using Pearson’s correlation and
values. The analysis shows strong correlation between economic deprivation
and antidepressant prescribing whereas it shows weak correlations between
economic deprivation and depression prevalence.
Drug Abuse
People voluntarily take drugs but most of them are addicted to them in
order to get rid of all their problems and feel relaxed. Adderall divinorum,
Snus, synthetic marijuana, and bath salts are the novel drugs. Opioid is a
category of drug that includes the illegitimate drug heroin. Hasan et al. [19]
compared four machine learning algorithms: logistic regression, random
forest, decision tree, and gradient boosting to predict the risk of opioid
use disorder. Random forest is one of the best methods of classification
in machine learning algorithms. It is found that in such types of situations
random forest models outperform the other three algorithms specially for
determining the features. There is another approach to predict drug abusers
using the search history of the user. Perdue et al. [20] predicted ratio of drug
abusers by comparing Google trends data with monitoring the future (MTF)
data; a well-structured study was made. It is concluded that Google trends
and MTF data provided combined support for detecting drug abuse.
Google trends appear to be a particularly useful data source regarding
novel drugs because Google is the first place where many users especially
adults go for information on topics of which they are unfamiliar. Google
tends not to predict heroin abuse; the reason may be that heroin is a relatively
uniquely dangerous than other drugs. According to Granka [23], internet
searches can be understood as behavioral measures of an individual’s
interest in an issue. Unfortunately, this technique was not going to be very
convenient as drug abuse researchers are unable to predict drug abuse
successfully because of sparse data.
HOW DATA SCIENCE HELPS TO PREDICT MENTAL

ILLNESS?
Currently, there are numerous mobile clinical devices which are established
in patients’ personal body networks and medical devices. They receive and
transmit massive amounts of heterogeneous fitness records to healthcare
statistics structures for patient’s evaluation. In this context, system learning

and data mining strategies have become extremely crucial in many real-
life problems. Many of those techniques were developed for health data
processing and processing on cellular gadgets.
There is a lot of data in the world of medicine as data is coming from
different sources like pharmacy and patient’s history and from nonproviders
(cell phone and internet searches). Big data needs to be interpreted in order to
predict future data, estimate hypothesis, and conclude results. Psychiatrists
should be able to evaluate results from research studies and commercial
analytical products that are based on big data.
Artificial Intelligence and Big Data

Big data collected from wearable tracking devices and electronic records
help to store accumulating and extensive amounts of data. Smart mobile apps
support fitness and health education, predict heart attack, and calculate ECG,
emotion detection, symptom tracking, and disease management. Mobile
apps can improve connection between patients and doctors. Once a patient’s
data from different resources is organized into a proper structure, artificial
intelligence (AI) algorithm can be used. After all, AI recognizes patterns,
finds similarity between them, and makes predictive recommendations
about what happened with those in that condition.
Techniques used for healthcare data processing can be widely
categorized into two classes: nonartificial intelligence systems and artificial
intelligence systems. Although non-AI techniques are less complex, but
they are suffering from a lack of convergence that gives inaccurate results as
compared to AI techniques. Contrary to that, AI methods are preferable then
non-AI techniques. In Table 3, Dimitrov [5] combined artificial intelligence
with IoT technology in existing healthcare apps so that connection between
doctors and patients remains balanced. Disease prediction is also possible
through machine learning. Figure 2 shows hierarchical structure of AI, ML,
and neural networks.
Table 3: Data analytics and predicting mental health

Authors Tool/technology Methodology Purpose of finding Strength Weakness
Dimitrov [5] (i) Sensing Emergence of medical Providing benefits to (i) Achieving Adding up
technology internet of things the customers improved mental garbage data to
(mIoT) in existing health the sensors
(ii) Artificial mobile apps (i) Avoiding chronic
intelligence and diet-related
illness
(ii) improving (ii) improving
cognitive function lifestyles in real
time decision-
making
Monteith et Survey based Clinical data mining Analyzing different Optimized N/A
al. [6] approach data sources to get precedence
psychiatry data opportunities for
psychiatry
Kellmeyer [7] Neurotechnology (i) Machine learning Enhancing the Maximizing Model needs
security of devices medical huge amount of
(ii) Consumer-directed and sheltering the knowledge training data as
neurotechnological privacy of personal brain disease is
devices brain rarely captured
(iii) Combining expert
with a bottom-up
process
Yang et al. [9] Long-term Well-being Developing app- Perfectly working Offline data
monitoring wearable questionnaires with a based devices linked on long-term data transfer instead
device with internet group of students to android phones of real time
of things and servers for
data visualization
monitoring and
environment sensing
Monteith and Automated decision- Hybrid algorithm that Tracking day-to- Automatically How to ignore
Glenn [10] making combines the statistical day behavior of the detecting human irrelevant
focus and data mining user by automatic decision without information is a
decision-making any input key headache
De Beurs et al. Online intervention Expert-driven method Increasing user Standardizing Deciding
[14] Intervention mapping involvement under the level of user threshold for
Scrum limited resources involvement in user involvement
the web-based is problematic
healthcare system
Kumar and Bala Hadoop Doing sentimental Analyzing twitter Checking out Usage of two
[12] analysis and saving data users’ view on a popularity of a programming
on Hadoop particular business particular service languages needs
product experts
Goyal [11] KNN and Naïve Text mining and hybrid Opinion mining of Cost-effective Data needs to be
Bayes classifier approach combining tweets related to food way to predict cleaned before
KNN and Naïve Bayes price crisis prizes training
Figure 2: AI and ML [24].

One of the machine learning algorithms named artificial neural network
(ANN) is based on three-layer architecture. Kellmeyer [7] introduced a way
to secure big brain data from neurotechnological devices using ANN. This
algorithm was working on a huge amount of data (train data) to predict
accurate results. But patients’ brain diseases are rare so training models on
small data may produce imprecise results. Machine learning models are data
hungry. To obtain accurate results as an output, there is a need of training
more data with distinct features as an input. These new methods cannot be
applicable on clinical data due to the limited economy resources.
Prediction through Smart Devices

Various monitoring wearable devices (Table 3) are available that continuously
capture the finer details of behaviors and provide important cues about fear
and autism. This information is helpful to recognize mental issues of the
user of those devices. Victims were monitored continuously for a month.
High level computation performed on the voice requires high complexity
data as well as high computational power which leads to a huge pressure on
the small chip. In order to overcome power issues, relatively low frequency
was chosen.
Yang et al. [9] invented an audio well-being device and conducted a
survey in which participants have to speak more than 10 minutes in a quiet
room. The first step is to choose the validity of the sample by completing
some questions (including STAI, NEO-FFI, and AQ) to the participants. In
order to determine whether they are suitable for the experiment or not, a
test was conducted based on an AQ question. There was a classification
algorithm applied on the AQ data. This type of device has one advantage; it
perfectly worked on long-term data instead of low-term one but they used
offline data transfer instead of real time.
Although it has different sensors, adding up garbage data to the sensors
is a very obvious thing. This is an application that offers on-hand record
management using mobile/tablet technology once security and privacy
are confirmed. To increase the reliability of IoT devices, there is a need to
increase the sample size with different age groups in real time environment
to check the validity of the experiment.
There are a lot of technologies that effectuate tracking data like
smartphones, credit cards, social media, and sensors. This paper discussed
some of the existing work to tackle such data. In Table 3, one of the
approaches is human made algorithm; searching for disease symptoms hits
disease websites, sending/receiving healthcare e-mail, and sharing health
information on social media through this kind of data. These are some
examples of activities that perform key rules to produce medical data.
Role of Social Media to Predict Mental Illness

Constant mood of the patient is one of the parameters to detect his/her
mental health. According to Lenhart, A. et al. [25] studid almost four out of
five internet users of social media. In Table 3, researchers used twitter data
to get online user review that helps the seeker to check out popularity of a
particular service or purchase a product. In order to collect opinion of people
on Airtel, they did analysis on it. Filter of the keyword is done using Filter
by content and Filter by location. First of all, special character, URL, spam,
and short words are removed from the tweets. Secondly, remaining words
from the tweets are then tokenized and TF-IDF score is calculated for all the
keywords. After cleaning of data, classification algorithm named K nearest
neighbor and Naïve Bayes algorithm were applied on the text in order to
extract feature. Location filters work on specific bounding filter. Although
hybrid recommendation system is providing 76.31% accuracy of the result,
then Naïve Bayes is 66.66%. At the end, automated system is designed for
opinion mining.
There is another point of consideration that Tweeter has unstructured
data so handling such a huge amount of unstructured data is a tedious task
to take up. Due to lack of schema structure, it is difficult to handle and store
unstructured data. There is a need for storage devices to store an insignificant
amount of data for processing. Cloud storage is the best option for such a
material. The entire program is designed in Python so that it could be able to
catch all possible outcomes. Hadoop works on cloud computing that helps to
accomplish different operations on distributed data in a systematic manner.
Success rate of the above approach was around 70% but authors have done
these tasks using two programming languages. Python code for extraction
tweets and Java is used to train the data which required expert programmers
on each language. It will help doctors to give more accurate treatment for
several mental disorders in less time and at low cost. Infecting this approach
provides predetection of depression that may preserve the patient to face the
worst stage of mental illness.
Key Challenges to Big Data Approach

(i)Big data has many ethical issues related to privacy, reusability without
permission, and involvement of the rival organization.(ii)To work in diverse
areas, big data requires collaboration with expert people in the relative field
including physicians, biologists, and developers that is crucial part of it. Data
mining algorithms can be used to observe or predict data more precisely
than traditional clinical trials.(iii)People may feel hesitant to describe all
things to the doctors. One of the solutions to estimate the bad mental illness
before time is automated decision-making without human input as shown
in Table 3 . It collects data from our behavior that is unsophisticated to the
digital economy. Key role of digital providence must be inferred in order to
understand the difficulties that technology may be responsible for people
with mental illness.(iv)There are many security issues while discussing
sensitive information online as data may be revealed so a new approach to
provide privacy protections as well as decision-making from the big data
through new technologies needs to be introduced.(v)Also, if online data is
used to predict user personality, then keeping data secured and protected
from hacker is a big challenge. A lot of cheap solutions exist but they are
not reliable from a user’s perspective especially.(vi)Major challenges for
enabling IoT in the device is communication; all of the above methods are
useless without the user involvement. User is one of the main parts of the
experiment especially if the user’s personal or live data is required. Although
many web-based inventions related to mental health are being released, the
actual problem of active participation by end users is limited. In Table 3, an
expert-driven method is introduced that is based on intervention mapping
and scrum methods. It may help to increase the involvement of the users.
But if all the users are actively involved in the web-based healthcare
system, then it becomes problematic.(vii)When deciding on the level of user
involvement, there is a need to decide about user input with the accessibility
of resources. It required an active role of technological companies and

efficient time consumption. Further research should provide direction on
how to select the best and optimized user-focused design strategies for the
development of web-based mental health under limited resources.
CONCLUSIONS
Big data are being used for mental health research in many parts of the world
and for many different purposes. Data science is a rapidly evolving field that
offers many valuable applications to mental health research, examples of
which we have outlined in this perspective.
We discussed different types of mental disorders and their reasonable,
affordable, and possible solution to enhance the mental healthcare facilities.
Currently, the digital mental health revolution is amplifying beyond the pace
of scientific evaluation and it is very clear that clinical communities need
to catch up. Various smart healthcare systems and devices developed that
reduce the death rate of mental patients and avert the patient to associate in
any illegal activities by early prediction.
This paper examines different prediction methods. Various machine
learning algorithms are popular to train data in order to predict future data.
Random forest model, Naïve Bayes, and k-mean clustering are popular
ML algorithms. Social media is one of the best sources of data gathering
as the mood of the user also reveals his/her psychological behavior. In
this survey, various advances in data science and its impact on the smart
healthcare system are points of consideration. It is concluded that there
is a need for a cost-effective way to predict intellectual condition instead
of grabbing costly devices. Twitter data is utilized for the saved and live
tweets accessible through application program interface (API). In the future,
connecting twitter API with python, then applying sentimental analysis on
‘posts,’ ‘liked pages’, ‘followed pages,’ and ‘comments’ of the twitter user
will provide a cost-effective way to detect depression for target patients.
ACKNOWLEDGMENTS
The authors are thankful to Prince Sultan University for the financial support
towards the publication of this paper.
REFERENCES
1. R. Bauer, T. Glenn, S. Strejilevich et al., “Internet use by older adults
with bipolar disorder: international survey results,” International
Journal of Bipolar Disorders, vol. 6, no. 1, p. 20, 2018.
2. Y.-A. De Montjoye, J. Quoidbach, F. Robic, and A. Pentland,
“Predicting personality using novel mobile phone-based metrics,” in
Proceedings of the International Conference on Social Computing,
Behavioral-Cultural Modeling, and Prediction, pp. 48–55, Berlin,
Heidelberg, April 2013.
3. W. Bauer and C. J. Hopwood, “Using machine learning to advance
personality assessment and theory,” Personality and Social Psychology
Review, vol. 23, no. 2, pp. 190–203, 2019.
4. C. Hill, J. L. Martin, S. Thomson, N. Scott-Ram, H. Penfold, and C.
Creswell, “Navigating the challenges of digital health innovation:
considerations and solutions in developing online and smartphone-
application-based interventions for mental health disorders,” British
Journal of Psychiatry, vol. 211, no. 2, pp. 65–69, 2017.
5. D. V. Dimitrov, “Medical internet of things and big data in healthcare,”
Healthcare Informatics Research, vol. 22, no. 3, pp. 156–163, 2016.
6. S. Monteith, T. Glenn, J. Geddes, and M. Bauer, “Big data are coming
to psychiatry: a general introduction,” International Journal of Bipolar
Disorders, vol. 3, no. 1, p. 21, 2015.
7. P. Kellmeyer, “Big brain data: on the responsible use of brain data
from clinical and consumer-directed neurotechnological devices,”
Neuroethics, vol. 11, pp. 1–16, 2018.
8. L. Jiang, B. Gao, J. Gu et al., “Wearable long-term social sensing for
mental wellbeing,” IEEE Sensors Journal, vol. 19, no. 19, 2019.
9. S. Yang, B. Gao, L. Jiang et al., “IoT structured long-term wearable
social sensing for mental wellbeing,” IEEE Internet of Things Journal,
vol. 6, no. 2, pp. 3652–3662, 2018.
10. S. Monteith and T. Glenn, “Automated decision-making and big data:
concerns for people with mental illness,” Current Psychiatry Reports,
vol. 18, no. 12, p. 112, 2016.
11. S. Goyal, “Sentimental analysis of twitter data using text mining and
hybrid classification approach,” International Journal of Advance
Research, Ideas and Innovations in Technology, vol. 2, no. 5, pp.
2454–132X, 2016.
12. M. Kumar and A. Bala, “Analyzing twitter sentiments through big

data,” in Proceedings of the 2016 3rd International Conference on
Computing for Sustainable Global Development (INDIACom), pp.
2628–2631, New Delhi, India, March 2016.
13. P. Dhaka and R. Johari, “Big data application: study and archival of
mental health data, using MongoDB,” in Proceedings of the 2016
International Conference on Electrical, Electronics, and Optimization
Techniques (ICEEOT), pp. 3228–3232, Chennai, India, March 2016.
14. D. De Beurs, I. Van Bruinessen, J. Noordman, R. Friele, and S. Van
Dulmen, “Active involvement of end users when developing web-
based mental health interventions,” Frontiers in Psychiatry, vol. 8, p.
72, 2017.
15. V. Turner, J. F. Gantz, D. Reinsel, and S. Minton, “The digital universe
of opportunities: rich data and the increasing value of the internet of
things,” IDC Analyze the Future, vol. 16, 2014.
16. I. C. Passos, P. Ballester, J. V. Pinto, B. Mwangi, and F. Kapczinski, “Big
data and machine learning meet the health sciences,” in Personalized
Psychiatry, pp. 1–13, Springer, Cham, Switzerland, 2019.
17. R. C. Kessler, S. L. Bernecker, R. M. Bossarte et al., “The role of big
data analytics in predicting suicide,” in Personalized Psychiatry, pp.
77–98, Springer, Cham, Switzerland, 2019.
18. B. Cleland, J. Wallace, R. Bond et al., “Insights into antidepressant
prescribing using open health data,” Big Data Research, vol. 12, pp.
41–48, 2018.
19. Hasan M. M., M. Noor-E-Alam, Patel M. R., Modestino A. S., Young
G. Sanchez L. D., A Novel Big Data Analytics Framework to Predict
the Risk of Opioid Use Disorder. 2019.
20. R. T. Perdue, J. Hawdon, and K. M. Thames, “Can big data predict the
rise of novel drug abuse?” Journal of Drug Issues, vol. 48, no. 4, pp.
508–518, 2018.
21. A. Furnham, “A big five facet analysis of sub-clinical dependent
personality disorder (dutifulness),” Psychiatry Research, vol. 270, pp.
622–626, 2018.
22. E. Salagre, E. Vieta, and I. Grande, “Personalized treatment in bipolar
disorder,” in Personalized Psychiatry, pp. 423–436, Academic Press,
Cambridge, MA, USA, 2020.
23. L. Granka, “Inferring the public agenda from implicit query data,”
in Proceedings of the 32nd International ACM SIGIR Conference on
Research and Development in Information Retrieval, Boston, MA,
USA, July 2009.
24. V. Sinha, 2019, https://www.quora.com/What-are-the-main-
differences-between-artificial-intelligence-and-machine-learning-Is-
machine-learning-a-part-of-artificial-intelligence.
25. Lenhart A., Purcell K., Smith A., Zickuhr K., Social media & mobile
internet use among teens and young adults. Millennials, Pew Internet
& American Life Project, Washington, DC, USA, 2010.
Chapter 2
Case Study on Data Analytics and

Machine Learning Accuracy
Abdullah Z. Alruhaymi, Charles J. Kim

ABSTRACT
The information gained after the data analysis is vital to implement its
outcomes to optimize processes and systems for more straightforward
problem-solving. Therefore, the first step of data analytics deals with
identifying data requirements, mainly how the data should be grouped or
labeled. For example, for data about Cybersecurity in organizations, grouping
can be done into categories such as DOS denial of services, unauthorized
access from local or remote, and surveillance and another probing. Next,
after identifying the groups, a researcher or whoever carrying out the data
analytics goes out into the field and primarily collects the data. The data
collected is then organized in an orderly fashion to enable easy analysis; we
aim to study different articles and compare performances for each algorithm
to choose the best suitable classifies.
Citation: Alruhaymi, A. and Kim, C. (2021), “Case Study on Data Analytics and Ma-
chine Learning Accuracy”. Journal of Data Analysis and Information Processing, 9,
249-270. doi: 10.4236/jdaip.2021.94015.
Copyright: © 2021 by authors and Scientific Research Publishing Inc. This work
is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0
Keywords: Data Analytics, Machine Learning, Accuracy, Cybersecurity,

Performance
INTRODUCTION
Data Analytics is a branch of data science that involves the extraction of
insights from data to gain a better understanding. It entails all the techniques,
data tools, and processes involved in identifying trends and measurements
that would otherwise be lost in the enormous amount of information available
and always getting generated in the world today. Grouping the dataset into
categories is an essential step of the analysis. Then, we go ahead and clean
up the data by removing any instances of duplication and errors done during
its collection.
In this step, there is also the identification of complete or incomplete
data and the implementation of the best technique to handle incomplete data.
The impact of missing values leads to an incomplete dataset in
machine learning (ML) algorithms’ performance causes inaccuracy and
misinterpretations.
Machine learning has emerged as a problem-solver for many existing
situation problems. Advancement in this field helps us with Artificial
intelligence (AI) in many applications we use daily in real life. However,
statistical models and other technologies failed to remedy our modern luxury
and were unsuccessful in holding categorical data, dealing with missing
values and significant data points [1]. All these reasons arise the importance
of Machine Learning Technology. Moreover, ML plays a vital role in many
applications, e.g., cyber detection, data mining, natural language processing,
and even disease diagnostics and medicine. In all these domains, we look for
a clue by which ML offers possible solutions.
Since ML algorithms do training with part of a dataset and tests with
the rest of the other dataset, unless missingness is entirely random and this
is rarely happening, missing elements in especially training dataset can
alter to insufficient capture of the entire population of the complete dataset.
Therefore, in turn, it would lead to lower performance with the test dataset.
However, if reasonably close values somehow replace the missing elements,
the performance for the imputed dataset would be restored correctly to the
level of the same as that of the intact, complete dataset. Therefore, this
research intends to investigate the performance variation under different
numbers of missing elements and under two other missingness mechanisms,
missing completely at random (MCAR) and missing at random (MAR).
Case Study on Data Analytics and Machine Learning Accuracy 27
Therefore, the objectives of this research dissertation are:

1) Investigation of the data analytic algorithms’ performance under
full dataset and incomplete dataset.
2) Imputation of a missing element in the incomplete dataset by
multiple imputation approaches to make imputed datasets.
3) Evaluation of the algorithms’ performance with imputed datasets.
The general distinction between most ML applications is deep learning,
and the data itself is the fuel of the process. Data analytics has evolved
dramatically over the last two decades. Hence more research is in this area
is inevitable. Since its importance, the background of the analytics field and
ML is promising and shiny. Much research was conducted on ML accuracy
measurements and data analysis, but few were done on incomplete data and
imputed data and the comparison outcome. We aim to highlight the results
drawn from different dataset versions and the missingness observed in the
dataset.
To create an incomplete dataset, we shall impose two types of
missingness, namely, MCAR and MAR into the complete dataset. MCAR
missingness will be created by inputting N/A into some variable entries to
create an impression of data missing by design. MAR missingness will also
be generated by inputting N/A into some cells of variables in the dataset.
This will create an impression that these incomplete variables are related to
some other variables within the dataset that are complete and hence this will
bring about the MAR missingness. These incomplete datasets will then be
imputed using a multiple imputation by chained equations (MICE) to try and
make it complete by removing the two types of missingness. The imputed
dataset will then be used to train and test machine learning algorithms. And
lastly, the performance of the algorithms with the imputed datasets will be
duly compared to performance of the same algorithms with the initially
complete dataset.
RESEARCH METHODOLOGY
This article is one chapter of the whole dissertation work, and to achieve the
objectives of the dissertation, the following tasks are performed:
1) cyber-threat dataset selection
2) ML algorithms selection
3) investigation and literature review on the missingness mechanisms
of MCAR and MAR
4) investigation of imputation approaches

5) application of multiple imputation approaches
6) evaluation of the algorithms under different missing levels (10%,
20%, and 30%) and two missingness mechanisms.
The first two items are discussed below, and the other items are detailed
in the succeeding chapters. The proposed workflow of the research is as
shown in the following Figure 1.
The methodology for this paper is through analyzing many online
internets article and compare for accuracy to use later with the selected
dataset performance, and from many algorithms used we select the best
that we think are suitable for cybersecurity dataset analysis. Four major
machine learning algorithms will be utilized to train and test using portions
of the imputed dataset to measure its performance in comparison with the
complete dataset.
Figure 1: Dissertation workflow.

These will include decision tree, random forest, support vector machines
and naïve bayes. These will be discussed in depth later in the dissertation.
To create the imputed dataset the multiple imputation by chained equations
(MICE) method will be used for we consider it a robust method in handling
missingness and therefore appropriate for this research. This method ideally
fills in the missing data values by using the iteration of prediction models
methodology where in each iteration a missing value is imputed by using the
already complete variables to predict it.
CYBER-THREAT DATASET SELECTION

The dataset selected for this research is KDDsubset dataset from Cyber
security Domain, a sample from KDDCUP’99 which consisted of 494,021
records (or instances). As shown in Figure 2, Attack types represent more
than 80% of the cyber dataset. Denial of Service (Dos) is the most dangerous
kind. It is columned in 42 features and the last feature in the data (42nd)
is labeled as either normal or attack for the different sub-types of attacks
(count 22) as shown in Figure 3 below.
Smurf has the highest count compared to other attacks labelled with only
one specific attack in each instance, we can visualize the classes from the
figure above that have a different number of attacks and observe that smurf
is the most frequent attack. The simulated attacks fall in one of the following
four categories: Denial of Service (DoS), Probe, U2R, or R2L. The column is
a connection type and is either an attack or normal. It is publicly available and
widely used in academic research, researchers often use the KDDsubset as a
sample of the whole KDDCUP99 dataset which consists of nearly five million
records. It covers all attack types and its much easier to make experimental
analysis with it. The 41 features are divided into four categories: basic, host,
traffic, and content. Feature number 2 for example, named protocol consists
of only 3 kinds and the most used protocol type is ICMP. Most of its records
are of the attack type. As shown below in Figure 4.
Figure 2: KDDsubset count of attack and normal.

Figure 3: Attack sub-types of count.
Figure 4: Protocol type has a lot of attacks in the ICMP.

The main problem of the KDDsubset is that it might contain of redundant
records which is not ideal when we try to build a data analysis algorithm
as it will make the model biased. Cyber security experts have developed
advanced techniques to explore technologies that detect cyber-attacks using
all of DARPA 1998 dataset used for intrusion detection. Improved versions
of this are the 10% KDDCUP’99, NSL-KDD Cup, and Gure KDDCUP
databases. The KDDCUP’99 dataset was used in the Third International
Knowledge Discovery and Data Mining Tools Competition, which was held
in conjunction with KDDCUP’99 and the Fifth International Conference on
Knowledge Discovery and Data Mining. The competition task was to build
a network intrusion detector, a predictive model capable of distinguishing
between “bad” connections, called intrusions or attacks, and “good’’ normal
connections. This database contains a standard set of data to be audited,
which includes a wide variety of intrusions simulated in a military network
environment [2]. The KDDCUP’99 dataset contains around 4,900,000 single
connection vectors, every one of which includes 41 attributes and includes
categories such as attack or normal [3], with precisely one specified attack-
type of four main type of attacks:
1) Denial of service (DoS): The use of excess resources denies legit
requests from legal users on the system.
2) Remote to local (R2L): Attacker having no account gains a legal
user account on the victim’s machine by sending packets over the
networks.
3) User to root (U2R): Attacker tries to access restricted privileges
of the machine.
4) Probe: Attacks that can automatically scan a network of computers
to gather information or find any unknown vulnerabilities.
All the 41 features are also labeled into four listed types:
1) Basic features: These characteristics tend to be derived from
packet headers which are no longer analyzing the payload.
2) Content features: Aanalyzing the actual TCP packet payload, and
here domain knowledge is used, and this encompasses features
that include the large variety of unsuccessful login attempts.
3) Time-based traffic features: These features are created to acquire
properties accruing over a 2-second temporal window.
4) Host-based traffic features: Make use of an old window calculated
over the numerous connections. Thus, host-based attributes are
created to analyze attacks, with a timeframe longer than 2 seconds
[4].
Most of the features are of a continuous variable type, where MICE use
multiple regression to account for uncertainty of the missing data, a standard
error is added to linear regression and in calculating stochastic regression,
a similar method of MICE called predictive mean matching was used.
However, some variables (is_guest_login, flag, land, etc.) are of binary type
or unordered categorical variables (discrete), MICE use a logistic regression

algorithm to squash the predicted values between (0 and 1) by using the
sigmoid function:
(1)
Below is an illustrative table data of the KDDCUP’99 with 41 features
from source: (https://kdd.ics.uci.edu/databases/kddcup99/task.html). As
shown in Table 1, we remove two features number 20 and 21 because their
values in the data are zeros.
The dataset used for testing the proposed regression model is the
KDDsubset network intrusion cyber database, and since this dataset is quite
a bit large and causes a time delay and slow execution of the R code due to
limited hardware equipment, the data is therefore cleaned.
Table 1: Attributes of the cyber dataset total 41 features
Nr Name Description
1 duration Duration of connnection
2 Protocol_type Connection protocol (tcp, udp, icmp)
3 service Dst port mapped to service
4 flag Normal or error status flag of connection
5 Src_bytes Number of data bytes from src to dst
6 dst_bytes Bytes from dst to src
7 land 1 if connection is from/to the same host/port; else 0
8 wrong_fragment Number of “wrong” fragments (values 0, 1, 3)
9 urgent Number of urgent packets
10 hot Number of “hot” indicators
11 number_failed_logins Number of failed login attempts
12 logged_in 1 if successfully logged in: else 0
13 num_compromised number of “compromised” conditions
14 root_shell 1 if root shell is obtained; else 0

15 su_attempted 1 if “su root” command attempted; else 0
16 num_root Number of “root” accesses
17 num_file__creations Number of file creation operations
18 num_shells Number of shell prompts
19 num_access_files Number of operations on access control files
20 num_outbound_cmds Number of outbound commands in and ftp session
21 Is_hot_login 1 if login belongs to “hot” list; else 0
22 Is_guest_login 1 if login is “guest” login else 0
23 count number of connections to same host as current con-

nection in the past two seconds
24 srv_count Number of connections to same service as current
connection in the past two seconds
25 serror_rate % of connections that have “SYN” errors
26 srv_serror_rate % of connections that have “SYN” errors
27 rerror_rate % of connections that have “REJ” errors
28 srv_rerror_rate % of connections that have “REJ” errors
29 same_srv_rate % of connections to the same service
30 diff_srv_rate % of connections to different services
31 Srv_diff_host_rate % of connections to different hosts
32 dst_host_count Count of connections having same dst host
33 dst_host__srv_count Count of connections having same des host and us-

ing same service
34 des host same srv rate % of connections having same dst host and using the
same servce
35 dst_host_diff_srv_rate % of different services on current host
36 dst_host_samesrc_port_ % of connections to current host having same src

rate port
37 dst_host_srv_diff_host_ % of connections to same service coming from diff
rate hosts
38 dst_host_serror rate % of connections to current host that have an SO
error
39 dst_host_srv_serror_rate % of connections to current host and specified ser-

vice that have an SO error
40 dst_host_rerror_rate % of connections to current host that have an RST
error
41 dst_host_srv_rerror_rate % of connections to current host and specified ser-
vice that have an RST error
42 connection_type N or A
N = normal, A = attack, c = continuous, d = discrete. Features numbered 2, 3,

4, 7, 12, 14 and 15 are discrete types, and the others are of continuous type.
The used, cleaned dataset has 145,585 connections and 40 numerical
features where three features are categorical, which are converted to
numeric. For the other 39 features describing the various connections, the
dataset is scaled and normalized by letting the mean be zero and the standard
deviation be equal to one. But the dataset is skewed to the right because of
the categorical variables as shown in Figure 5 below.
To make the data short, we exclude the label variables and add them
later for testing by the machine learning algorithms. The dataset is utilized
to test and evaluate intrusion detection with both normal and malicious
connections labeled as either attack or normal. After much literature review
on the same dataset and how the training and testing data are divided, we
found that the best method could be letting the training data be 66% and
testing data be 34%. The cleaned dataset contains 94,631 connections as
training data and 50,954 connections as testing data. The training data
employed for missingness mechanisms is the MCAR and MAR. The
assumptions are done cell wise, so the total number of cells is 3,690,609 and
sampling from it random. The labels were excluded from the training data to
perform the missingness identification for two kinds of missingness. Then
when the data is fed to the classifier models, we return the labels. The test
data is taken from the original clean data and the label left unchanged. So,
the test data is tested for all experiments of Machine Learning algorithms.
Measurement for the accuracies of the cleaned data is posted before doing
any treatment to it for the purpose of comparing missing data and imputed
data with MICE with the baseline accuracy looked at for any useful data
extracted from the main data analyzed. Now we feed the missing data and
the imputed data to the classifiers. We analyze the performance of the four
best-chosen classifiers on the dataset. The classifiers selected from literature
were considered the best in the evaluation of performance.
ML ALGORITHMS SELECTION
A huge substantial amount of data is available to organizations from a
diverse log, application logs, intrusion detection systems, and others. Year
over year, the data volume increases significantly and is produced by every
person every second, with the number of devices connected to the internet
being three times more than the world population.
Figure 5: KDDsubset normal distribution.

A large part of which works within the framework of the internet of
things (IoT) and these results to more than 600 ZB of data each year.
These facts show the significant development witnessed by the data in
terms of type, size, and speed. This vast data is attributed to several reasons.
The most important is digitization processes produced by companies
and institutions in the past years and the widespread social media and
applications of conversations and the internet of things. This growth in
various technology fields has made the internet a tempting target for misuse
and anomaly intrusion. Many researchers are thus engaged in analyzing the
KDDCUP’99 for detecting intrusions. Analysis of the KDDsubset to test for
the accuracy and misclassification of the data we find out of the 24 articles
reviewed, 13 were straightforward with this dataset. In contrast, the others
dealt with a modified versions of NSL-KDD and GureKDD. Some themes
that are related directly to the KDDsubset were summarized in an Excel
sheet and observed below.
Not all algorithms fail when there is missing data. Some algorithms
use the missing value as a unique and different value when building the
predictive model, such as classification and regression trees.
Summary review of 10 Articles. To find the most popular Algorithms to

apply for the analysis.
Article 1. Summarized as shown in Table 2 below.
No accuracy was posted to this article that uses fuzzy inferences. It uses
an effective set of fuzzy rules for inference approach which is identified
automatically by making use of the fuzzy rule learning strategy and seems to
be more effective for detecting intrusions in a computer system. Then rules
were given to sugeno fuzzy system which classify the test data.
Accuracy for this study was low with classifier called rule based
enhanced genetic (RBEGEN). This is an enhancement of genetic algorithm.
In this study, 11 algorithms were used, and we can note that we got high
accuracies in decision tree and random forest.
Table 2: Fuzzy Inference system
Article Clas- Results

sification
Technique Accu- TT (sec) predic- TP TN FP FN Error Preci- Recall DR FAR
racy tion rate sion
1: Intrusion Sugeno- . . . . . . . . . . . ,
Detection fuzzy infer-
system ence system
using fuzzy for genera-
inference tion of fuzzy
system rules and
best first
methods
under select
Table 3: Enhance algorithm

sification
Technique Accu- TT predic- TP TN FP FN Error Precision Recall DR FAR
racy (sec) tion rate
2: Intrusion Rule based 86.30% . . . . . . . 83.21% #### . .

Detection over enhance
networking genetic
KDD dataset (RBEGEN)
using Enhance
mining algo-
rithm
Table 4: Features extraction

sification
Technique Accuracy TT prediction TP TN FP FN Error Precision Recall DR FAR
(sec) rate
3: Detecting Decision 95.09% 1.032 0.003 4649 2702 279 100 4.9 94.34% 97.34% . ,
anomaly Tree
based
MLP 92.46% 20.59 0.004 4729 2419 562 29 7.54 89.38 99.56%
network
intrusion us-
KNN 92.78% 82.956 13.24 4726 2446 535 23 7.22 89.83% 99.52%
ing feature
extraction
and clas- Linear 92.59% 78.343 2.11 4723 2434 547 26 7.41 89.62% 99.45%
sification SVM
techniques Passive ag- 90.34 0.275 0.001 4701 2282 699 48 9.66 89.62% 99.45%
gressive
RBF SVM 91.67% 99.47 2.547 4726 2960 621 23 8.33 89.39% 99.52%
Random 93.62% 1.189 0.027 4677 2560 621 23 6.38 91.74% 98.48%
Forest
AdaBoost 93.52% 29.556 0.225 4676 2553 428 73 6.48 91.61% 98.46%
Gausian 94.35% 244 0.006 4642 2651 330 107 5.65 93.36% 97.75% - -
NB
Multion- 91.71% 0.429 0.001 4732 2357 624 17 8.29 88.35% 99.64% - -
mINB
Adratic 93.23% 1.305 0.0019 4677 2530 451 72 6.77 91.20% 84.87% - -
Discriminat
Ana
Table 5: Outlier detection

sification
Technique Accuracy TT (sec) predic- TP TN FP FN Error Precision Recall DR FAR
tion rate
4: Frature C45 99.94% 199.33 be- . . . . . . . . . .

Classifica- fore 23.14
tion and after
outlier de- KNN 99.90% 0.37 before . . . . . . . . . .
tection to 0.23 After
increased
accu- Naïve 96.16% 5.63 before . . . . . . . . . .
racy in bayes 1.36 after
intrusion
detection Random 99.94% 554.63 be- . . . . . . . . . .
system. forest fore 205.97
after
SVM 99.94% 699.07 be- . . . . . . . . . .
fore 186.53
after
Three datasets were used in this study and one of them is the KDDCUP’99
to compare accuracy and execution time before and after dimensionality
reduction.
Five algorithms were used.
Singulars Values Decomposition (SVD) is eigenvalues method used to
reduce a high-dimensional dataset into fewer dimensions while retaining
important information and uses improved version of the algorithm (ISVD).
In article number 7 above two classifiers were used and for J48 we have
high accuracy results.
Table 6: Three-based data mining
Article Classification Results

Technique
Accuracy TT pre- TP TN FP FN Error Preci- Recall DR FAR
(sec) dic- rate sion
tion
5: Intrusion Hoeffding 97.05% . . . . . . 2.9499 . . . .
Detection Tree
with Three
based Data J48 98.04% . . . . . . 1.9584 . . . .
Mining Random 98.08% . . . . . . 1.1918 . . . ,
classification Forest
techniques by
using KDD- Random Tree 98.03% . . . . . . 1.9629 . . . .
dataset Req Tree 98.02% . . . . . . 1.9738 . . . .
Table 7: Data reduction

sification
Technique Accuracy TT (sec) prediction TP TN FP FN Error Preci- Recall DR FAR
rate sion
6: Using SVD 43.73% 45.154 10.289 43.67% 56.20% 53.33% 43.8 0.5 . . . .
an imputed
Data detec- ISVD 94.34% 189.232 66.72 92.86 . . 95.82 0.55
tion method
in intrusion
detection
system
Table 8: Comparative analysis

sification
Accuracy TT predic- TP TN FP FN Error Precision Recall DR FAR
Tech-
(sec) tion rate
nique
7: Compara- J48 99.80% 1.8 . . . . . . 99.80% 99.80% . .
tive analysis
Naïve 84.10% 47 . . . . . . 97.20% 77.20% . .
of clas-
bayes
sification
algorithms
on KDD’99
Dataset
Article 8 and 9. Summarized as shown in Table 9 below.

We have 10 classifuers used to take measurement metrics for the dataset
preprocessed and non-preprocessed and results are better with the processed
dataset.
Accuracies Percentage for the above Articles:

As shown in Figure 6 below, the percent of accuracy for each classifier
is highlighted.
These tables represent a paper that group major attack types and separates
the 10% KDDCUP’99 into five files according to the attack types (DoS,
Probe, R2L, U2R, and normal). Summarized as shown in Table 10 below.
Based on the attack type, DoS and Probe attacks involve several records,
while R2L and U2R are embedded in the data portions of packets and usually
involve only a single instance.
The above articles were summarized in measuring metrics, and had
nearly 61 classifiers results with the best applicable algorithms.
Table 9: Problems in dataset

sification
Tech- Accuracy TT (sec) prediction TP TN FP FN Error Pre- Re- DR FAR
nique rate ci- call
sion
8: Problems Naïve 96.31% 6.16 381.45 . . . . . . . . .
of KDD Cup Bayes
99 Dataset Bayes 99.65% 43.08 57.38 . . . . . . . . .
existed and Net
data prepro-
cessing Liblinear 99.00% 3557.76 6.1 . . . . . . . . .
MLP 92.92% 22,010.2 25.4 . . . . . . . . .
IBK 100.00% 0.39 79,304.62 . . . . . . . . .
Vote 56.84% 0.17 3 . . . . . . . . .
OneR 98.14% 5.99 6.38 . . . . . . . . .
J48 99.98% 99.43 5.8 . . . . . . . . .
Random 100.00% 122.29 4.23 . . . . . . . . .

forest
Random 100.00% 14.79 19.09 . . . . . . . . .
Tree
9: Problems Naïves 90.45% 3.21 116.06 . . . . . . . . .

of KDD Cup Bayes
99 Dataset Bayes 99.13% 26.65 35.04 . . . . . . . . .
existed and Net
data prepro-
cessing Liblinear 98.95% 708.36 4.27 . . . . . . . . .
MLP 99.66% 11,245.8 53.31 . . . . . . . . .
IBK 100.00% 0.19 48,255.78 . . . . . . . . .
Vote 56.84 0.13 3.45 . . . . . . . . .
OneR 98.98% 3.76 5.35 . . . . . . . . .
J48 99.97% 99.56 5.26 . . . . . . . . .
Random 99.99% 10.89 5.63 . . . . . . . . .

Forest
Random 100.00% 10.45 4.26 . . . . . . . . .
Tree
Table 10: Dataset grouping
Article Classification Class Results

Technique name
AA TT (sec) Test time TP TN FP FN Error Precision Recall DR FAR

(sec) rate
10: Application Bayes Net Dos 90.62% 628 . 94.60% . 0.20% . . . . . .

of Data mining
to Network Intru-
sion detection:
Classifier selec-
tion model
Probe 88.80% . 0.12% . . . . . .
U2R 30.30% . 0.30% . . . . . .
R2L 5.20% . 0.60% . . . . . .
Naïves Bayes Dos 78.32% 557 . 79.20% . 1.70% . . . . . .
Probe 94.80% . 13.30% . . . . . .
U2R 12.20% . 0.90% . . . . . .
R2L 0.10% . 0.30% . . . . . .
J48 Dos 92.06% 1585 . 96.80% . 1.00% . . . . . .
Probe 75.20% . 0.20% . . . . . .
U2R 12.20% . 0.10% . . . . . .
R2L 0.10% . 0.50% . . . . . .
NB Tree Dos 92.28% 295.88 . 97.40% . 1.20% . . . . . .

Probe 73.30% . 1.10% . . . . . .
U2R 1.20% . 0.10% . . . . . .
R2L 0.10% . 0.50% . . . . . .
Decision Dos 91.66% 6624 . 97% . 10.70% . . . . . .

Table
Probe 57.60% . 40% . . . . . .
U2R 32.80% . 0.30% . . . . . .
R2L 0.30% . 0.10% . . . . . .
Jrip Dos 0.923 207.47 . 97.40% . 0.30% . . . . . .
Probe 83.80% . 0.10% . . . . . .
U2R 12.80% . 0.10% . . . . . .
R2L 0.10% . 0.40% . . . . . .
One R Dos 0.8931 375 . 94.20% . 6.80% . . . . . .
Probe 12.90% . 0.10% . . . . . .
U2R 10.70% . 2.00% . . . . . .
R2L 10.70% . 0.10% . . . . . .
MLP Dos 0.9203 350.15 . 96.90% . 1.47% . . . . . .
Probe 74.30% . 0.10% . . . . . .
U2R 20.10% . 0.10% . . . . . .
R2L 0.30% . 0.50% . . . . . .
As a result of the literature review, and observations, a conclusion was

made to select the following four most influential and widely used algorithms
which are; decision tree, random forest, support vector machine and naïve
bayes. To apply for the dissertation study.
1) Decision tree: [5] This is a flowchart like tree system that is the
working area of this classifier, and it is divided into subparts via
identifying lines. It uses entropy and information gain to build the
decision tree by selecting features that increase the IG and reduce
the entropy. Classification and Regression Trees, abbreviated
CART is a useful tool that works for classification or regression
of the predictive modeling problems.
To build a decision tree model, we follow these steps:

a) The Entropy should be calculated before the splitting for the
Target Columns.
b) Select a feature with the Target column and calculate the IG
(Information Gain) and the Entropy.
c) Then, the largest information gain should be selected.
d) The selected features are set to be the root of the tree that then
splits the rest of the features.
e) The algorithm repeats from 2 to 4 until the leaf has a decision
target.
Figure 6: Accuracies for different algorithms depicted.

In short, entropy measures the homogeneity of a sample of data. The

value is between zero and one, zero only when the data is completely
homogenous, and one only when data is non-homogenous.
(2)
Information gain measures the reduction in entropy or surprise by
splitting a dataset according to a given value of a random variable. A larger
information gain suggests a lower entropy group or groups of samples, and
hence less surprise.
(3)
where:
f: feature split on
Dp: dataset of the parent mode
Dleft: dataset of the left child node
Dright: dataset of the right child node
I: impurity criterion (Gini index or Entropy)
N: total number of samples
Nleft: number of samples at left child node
Nright: number of samples at right child node [6].
Figure 7 below indicates how the Decision Tree works.
2) Random Forest: [7] The random forest algorithm is a supervised
classification algorithm like a decision tree and instead of one
tree, this classifier uses multiple trees and merges them to obtain
better accuracy and prediction. In random forests each tree in the
ensemble is built from a sample drawn with a replacement from
a training set called bagging (Bootstrapping) and this improves
stability of the model. Figure 8 below shows the mechanism of
this algorithm.
Figure 7: Decision Tree algorithm.

3) Support vector machines (SVM): [8] SVM is memory efficient
and uses a subset of training points in the decision. It is a set of
supervised machine learning procedures used for classification,
regression, and outlier detection. Different kernel functions can be
specified for a decision function—example of supervised learning
algorithms which belong to both the regression and classification
categories of machine learning algorithms. Regarding limitations
of data dimensionality and limited samples, this classifier does
not suffer from these mentioned limitations.
It contains three functions linear, polynomial, and sigmoid and so the
user can select any one of the functions to classify the data.
Kernel Functions
Figure 8: Three decision trees consist Random Forest model.

One of the most important features of SVM is that it utilizes the kernel
trick to extend the decision functions to be used for linear and non-linear
cases.
Linear Kernel. It can be used as a dot product between any two observations
and it’s the simplest kernel function.
(4)
Polynomial kernel: It is a more complex function that can be used to
distinguish between non-linear inputs. And it can be represented as:
(5)
where p is the polynomial degree. Radial basis function is a kernel function
that helps in non-linear cases, as it computes the similarity that depends on
the distance between the origin or from some points:
RBF (Gaussian):
(6)
Figure 9 below shows how SVM work.

4) Naïve Bayes: [10] Naïve Bayes classifiers are a family of simple
“probabilistic classifiers” based on applying bayes’ theorem with
strong (naïve) independence assumptions between the features.
This classifier provides a simple approach, with precise semantics,
to represent and learn probabilistic knowledge.
Naïve Bayes works on the principle of conditional probability as given
by the bayes theorem and formulates linear regression using probability
distribution.
The posterior probability of the model parameters is conditional upon the
training inputs and outputs: The aim is to determine the posterior distribution
for the model parameters. The Bayes Rule is shown in Equation (7) below:
(7)
(8)
(9)
Figure 9. Support Vector Machine algorithm [9].

Figure 9: Support Vector Machine algorithm [9].
ACCURACY OF MACHINE LEARNING

The aim is not only to hypothesize and prove them. But also, to answer
the following scientific questions; first, whether scientific inquiry is about
missing data and how the classifiers perform. Which one works better?
and how missingness impacts Machine Learning Performance? Does the
accuracy increase or decrease if we impute data correctly with the correct
imputation number? Can the restoration of the missing data accuracy be one
metric for evaluating the performance? The number of instances a model
can predict correctly.
Accuracy is an essential measurement tool that measures values from
the experiment, how close to the actual value? How relative a measured
value is to the “true” value, and precision of how close the data is to each
other?
Although the accuracy in some models may be achieved in high results,
it is still not significantly accurate enough, and we need to check for other
factors.
Because the model might give good results and still be biased toward
some frequent records, some classification algorithms will also consume
extended hours or even more to get the required products. Certainly,
classification accuracy alone can be misleading if we have an unequal
number of observations in each class or more than two classes in the dataset.
Suppose we import specific libraries to solve the mentioned problem above
by balancing types called oversampling or down sampling. In that case,
that means the library will create equal samples fed to the ML algorithm to
classify. Also, the best evaluation metrics for such is the confusion Matrix; as
shown in Figure 10 below, is it appropriate enough for data to be analyzed?
A confusion matrix is a technique for summarizing the performance of a
classification algorithm.
Accuracy = (correctly predicted class/total testing class) × 100%, the
accuracy can be defined as the percentage of correctly classified instances
Acc = (TP + TN)/(TP + TN + FP + FN).
Where TP, FN, FP, and TN represent the number of true positives, false
negatives, false positives, and true negatives, respectively.
Also, you can use standard performance measures:
Sensitivity = TP/TP + FN; R = TP/(TP + FN)
Specificity = TN/TN + FP
Precision = TP/TP + FP; P = TP/(TP + FP)

True-Positive Rate = TP/TP + FN
False-Positive Rate = FP/FP + TN
True-Negative Rate = TN/TN + FP
False-Negative Rate = FN/FN + TP
(T for true and F for false, N for negative and P for positive).
For good classifiers, TPR and TNR both should be nearer to 100%.
Similar is the case with precision and accuracy parameters.
On the contrary to: FPR and FNR both should be as close to 0% as
possible.
Detection Rate (DR) = number of instances notified/total number of
instances estimated.
FPR = FP/ALL POSITIVE; TPR = TP/ALL NEGATIVE
Figure 10: Confusion Matrix for attack (A) and normal (N).
On Missing Data
The more accurate the measurement, the better it will be. The missing values
are imputed with best guesses and used to work if the missing values are
small then drop the records with missing values if the data is large enough.
However, this was the case before the multivariate approach, but now this is
not the case anymore.
Accuracy with missing data and because all the rows and columns are
of numerical values so, when we make the data incomplete with R code, the
replacement for empty cells is done using NA.
And the classifiers may not work with the NA, instead, we can substitute
with mean for each variable or median or mode or just with constant
number -9999 and we run the code and this works with a constant number.
We conclude that the outcome of accuracy may decrease, or maybe some

algorithms will not respond.
On Imputed Data
We assume that with reasonable multivariate imputation procedures, the
accuracy will be close enough to the baseline accuracy of the original dataset
before we make it incomplete, then we impute for both mechanisms. The
results will be shown in chapter 5.
CONCLUSION
This paper provides a survey of different machine learning techniques for
measuring accuracy for different ML classifiers used to detect intrusions
in the KDDsubset dataset. Many algorithms have shown promising results
because they identify the attribute accurately. The best algorithms were
chosen to test our dataset and results posted in a different chapter. The
performance of four machine learning algorithms has been analyzed using
complete and incomplete versions of the KDDCUP’99 dataset. From this
investigation it can be concluded that the accuracy of these algorithms is
greatly affected when the dataset containing missing data is used to train
these algorithms and they cannot be relied upon in solving real world
problems. However, after using the multiple imputation by chained equation
(MICE) to remove the missingness in the dataset the accuracy of the four
algorithms increases exponentially and is even almost equal to that of the
original complete dataset. This is clearly indicated by the confusion matrix
in section 5 where TNR and the TPR are both close to 100% while the
FNR and FPR are both close to zero. This paper has clearly indicated that
the performance of machine learning algorithms decreases greatly when a
dataset contains missing data, but this performance can be increased by using
MICE to get rid of the missingness. Some classifiers have better accuracy
than others, so we should be careful to choose the suitable algorithms for
each independent case. We conclude from the survey and observation that
the chosen classifiers work best with cybersecurity systems, while others
are not and may be helpful in different domains. A Survey of many articles
provides a beneficial chance for analyzing detecting attacks and offers an
opportunity for improved decision-making in which model is the best to use.
ACKNOWLEDGEMENTS
The authors would like to thank the anonymous reviewers for their valuable
suggestions and notes; thanks, extended to Scientific Research/Journal of
Data Analysis and Information Processing.
REFERENCES
1. Fatima, M. and Pasha, M. (2017) Survey of Machine Learning
Algorithms for Disease Diagnostic. Journal of Intelligent Learning
Systems and Applications, 9, 1-16. https://doi.org/10.4236/
jilsa.2017.91001
2. Kim, D.S. and Park, J.S. (2003) Network-Based Intrusion Detection
with Support Vector Machines. In: International Conference on
Information Networking. Springer, Berlin, Heidelberg, 747-756.
https://doi.org/10.1007/978-3-540-45235-5_73
3. Tavallaee, M., Bagheri, E., Lu, W. and Ghorbani, A.A. (2009)
A Detailed Analysis of the KDD CUP 99 Data Set. 2009 IEEE
Symposium on Computational Intelligence for Security and Defense
Applications, Ottawa, 8-10 July 2009, 1-6. https://doi.org/10.1109/
CISDA.2009.5356528
4. Sainis, N., Srivastava, D. and Singh, R. (2018) Feature Classification
and Outlier Detection to Increased Accuracy in Intrusion Detection
System. International Journal of Applied Engineering Research, 13,
7249-7255.
5. Sharma, H. and Kumar, S. (2016) A Survey on Decision Tree
Algorithms of Classification in Data Mining. International Journal of
Science and Research (IJSR), 5, 2094-2097. https://doi.org/10.21275/
v5i4.NOV162954
6. Singh, S. and Gupta, P. (2014) Comparative Study ID3, Cart and C4. 5
Decision Tree Algorithm: A Survey. International Journal of Advanced
Information Science and Technology (IJAIST), 27, 97-103.
7. Chen, J., Li, K., Tang, Z., Bilal, K., Yu, S., Weng, C. and Li, K. (2016)
A Parallel Random Forest Algorithm for Big Data in a Spark Cloud
Computing Environment. IEEE Transactions on Parallel and Distributed
Systems, 28, 919-933. https://doi.org/10.1109/TPDS.2016.2603511
8. Suthaharan, S. (2016) Support Vector Machine. In: Machine Learning
Models and Algorithms for Big Data Classification. Springer, Boston,
207-235. https://doi.org/10.1007/978-1-4899-7641-3_9
9. Larhman (2018) Linear Support Vector Machines. https://en.wikipedia.
org/wiki/Support-vector_machine
10. Chen, S., Webb, G.I., Liu, L. and Ma, X. (2020) A Novel Selective
Naïve Bayes Algorithm. Knowledge-Based Systems, 192, Article ID:
105361. https://doi.org/10.1016/j.knosys.2019.105361
Chapter 3
Data Modeling and Data Analytics: A

Survey from a Big Data Perspective
André Ribeiro, Afonso Silva, Alberto Rodrigues da Silva

ABSTRACT
These last years we have been witnessing a tremendous growth in the volume
and availability of data. This fact results primarily from the emergence of
a multitude of sources (e.g. computers, mobile devices, sensors or social
networks) that are continuously producing either structured, semi-structured
or unstructured data. Database Management Systems and Data Warehouses
are no longer the only technologies used to store and analyze datasets, namely
due to the volume and complex structure of nowadays data that degrade
their performance and scalability. Big Data is one of the recent challenges,
since it implies new requirements in terms of data storage, processing and
visualization. Despite that, analyzing properly Big Data can constitute
great advantages because it allows discovering patterns and correlations in
datasets. Users can use this processed information to gain deeper insights
and to get business advantages. Thus, data modeling and data analytics are
Citation: Ribeiro, A. , Silva, A. and da Silva, A. (2015), “Data Modeling and Data
Analytics: A Survey from a Big Data Perspective”. Journal of Software Engineering
and Applications, 8, 617-634. doi: 10.4236/jsea.2015.812058.
Copyright: © 2015 by authors and Scientific Research Publishing Inc. This work is li-
censed under the Creative Commons Attribution International License (CC BY). http://
creativecommons.org/licenses/by/4.0
evolved in a way that we are able to process huge amounts of data without
compromising performance and availability, but instead by “relaxing” the
usual ACID properties. This paper provides a broad view and discussion of
the current state of this subject with a particular focus on data modeling and
data analytics, describing and clarifying the main differences between the
three main approaches in what concerns these aspects, namely: operational
databases, decision support databases and Big Data technologies.
Keywords: Data Modeling, Data Analytics, Modeling Language, Big Data
INTRODUCTION
We have been witnessing to an exponential growth of the volume of data
produced and stored. This can be explained by the evolution of the technology
that results in the proliferation of data with different formats from the most
various domains (e.g. health care, banking, government or logistics) and
sources (e.g. sensors, social networks or mobile devices). We have assisted a
paradigm shift from simple books to sophisticated databases that keep being
populated every second at an immensely fast rate. Internet and social media
also highly contribute to the worsening of this situation [1] . Facebook, for
example, has an average of 4.75 billion pieces of content shared among
friends every day [2] . Traditional Relational Database Management Systems
(RDBMSs) and Data Warehouses (DWs) are designed to handle a certain
amount of data, typically structured, which is completely different from
the reality that we are facing nowadays. Business is generating enormous
quantities of data that are too big to be processed and analyzed by the
traditional RDBMSs and DWs technologies, which are struggling to meet
the performance and scalability requirements.
Therefore, in the recent years, a new approach that aims to mitigate
these limitations has emerged. Companies like Facebook, Google, Yahoo
and Amazon are the pioneers in creating solutions to deal with these “Big
Data” scenarios, namely recurring to technologies like Hadoop [3] [4] and
MapReduce [5] . Big Data is a generic term used to refer to massive and
complex datasets, which are made of a variety of data structures (structured,
semi- structured and unstructured data) from a multitude of sources [6] . Big
Data can be characterized by three Vs: volume (amount of data), velocity
(speed of data in and out) and variety (kinds of data types and sources) [7] .
Still, there are added some other Vs for variability, veracity and value [8] .
Data Modeling and Data Analytics: A Survey from a Big Data Perspective 57
Adopting Big Data-based technologies not only mitigates the problems

presented above, but also opens new perspectives that allow extracting value
from Big Data. Big Data-based technologies are being applied with success in
multiple scenarios [1] [9] [10] like in: (1) e-commerce and marketing, where
count the clicks that the crowds do on the web allow identifying trends that
improve campaigns, evaluate personal profiles of a user, so that the content
shown is the one he will most likely enjoy; (2) government and public health,
allowing the detection and tracking of disease outbreaks via social media or
detect frauds; (3) transportation, industry and surveillance, with real-time
improved estimated times of arrival and smart use of resources.
This paper provides a broad view of the current state of this area based
on two dimensions or perspectives: Data Modeling and Data Analytics.
Table 1 summarizes the focus of this paper, namely by identifying three
representative approaches considered to explain the evolution of Data
Modeling and Data Analytics. These approaches are: Operational databases,
Decision Support databases and Big Data technologies.
This research work has been conducted in the scope of the DataStorm
project [11] , led by our research group, which focuses on addressing the
design, implementation and operation of the current problems with Big
Data- based applications. More specifically, the goal of our team in this
project is to identify the main concepts and patterns that characterize such
applications, in order to define and apply suitable domain-specific languages
(DSLs). Then these DSLs will be used in a Model-Driven Engineering
(MDE) [12] -[14] approach aiming to ease the design, implementation and
operation of such data-intensive applications.
To ease the explanation and better support the discussion throughout
the paper, we use a very simple case study based on a fictions academic
management system described below:
The outline of this paper is as follows: Section 2 describes Data Modeling

and some representative types of data models used in operational databases,
decision support databases and Big Data technologies. Section 3 details
the type of operations performed in terms of Data Analytics for these three
approaches. Section 4 compares and discusses each approach in terms of
the Data Modeling and Data Analytics perspectives. Section 5 discusses our
research in comparison with the related work. Finally, Section 6 concludes

the paper by summarizing its key points and identifying future work.
DATA MODELING
This section gives an in-depth look of the most popular data models used to
define and support Operational Databases, Data Warehouses and Big Data
technologies.
Table 1: Approaches and perspectives of the survey
Databases are widely used either for personal or enterprise use, namely
due to their strong ACID guarantees (atomicity, consistency, isolation and
durability) guarantees and the maturity level of Database Management
Systems (DBMSs) that support them [15] .
The data modeling process may involve the definition of three data
models (or schemas) defined at different abstraction levels, namely
Conceptual, Logical and Physical data models [15] [16] . Figure 1 shows
part of the three data models for the AMS case study. All these models define
three entities (Person, Student and Professor) and their main relationships
(teach and supervise associations).
Conceptual Data Model. A conceptual data model is used to define,
at a very high and platform-independent level of abstraction, the entities
or concepts, which represent the data of the problem domain, and their
relationships. It leaves further details about the entities (such as their
attributes, types or primary keys) for the next steps. This model is typically
used to explore domain concepts with the stakeholders and can be omitted
or used instead of the logical data model.
Logical Data Model. A logical data model is a refinement of the previous
conceptual model. It details the domain entities and their relationships, but
standing also at a platform-independent level. It depicts all the attributes
that characterize each entity (possibly also including its unique identifier,
the primary key) and all the relationships between the entities (possibly
including the keys identifying those relationships, the foreign keys). Despite
being independent of any DBMS, this model can easily be mapped on to a
physical data model thanks to the details it provides.
Physical Data Model. A physical data model visually represents the
structure of the data as implemented by a given class of DBMS. Therefore,
entities are represented as tables, attributes are represented as table columns
and have a given data type that can vary according to the chosen DBMS,
and the relationships between each table are identified through foreign keys.
Unlike the previous models, this model tends to be platform-specific, because
it reflects the database schema and, consequently, some platform-specific
aspects (e.g. database-specific data types or query language extensions).
Summarizing, the complexity and detail increase from a conceptual to
a physical data model. First, it is important to perceive at a higher level of
abstraction, the data entities and their relationships using a Conceptual Data
Model. Then, the focus is on detailing those entities without worrying about
implementation details using a Logical Data Model. Finally, a Physical Data
Model allows to represent how data is supported by a given DBMS [15]
[16] .
Operational Databases
Databases had a great boost with the popularity of the Relational Model
[17] proposed by E. F. Codd in 1970. The Relational Model overcame the
problems of predecessors data models (namely the Hierarchical Model and
the Navigational Model [18] ). The Relational Model caused the emergence
of Relational Database Management Systems (RDBMSs), which are the most
used and popular DBMSs, as well as the definition of the Structured Query
Language (SQL) [19] as the standard language for defining and manipulating
data in RDBMSs. RDBMSs are widely used for maintaining data of daily
operations. Considering the data modeling of operational databases there are
two main models: the Relational and the Entity-Relationship (ER) models.
Relational Model. The Relational Model is based on the mathematical
concept of relation. A relation is defined as a set (in mathematics terminology)
and is represented as a table, which is a matrix of columns and rows, holding
information about the domain entities and the relationships among them.
Each column of the table corresponds to an entity attribute and specifies
the attribute’s name and its type (known as domain). Each row of the table
(known as tuple) corresponds to a single element of the represented domain
entity.
Physical Data Model
Figure 1: Example of three data models (at different abstraction levels) for the
Academic Management System.
In the Relational Model each row is unique and therefore a table has
an attribute or set of attributes known as primary key, used to univocally
identify those rows. Tables are related with each other by sharing one or
more common attributes. These attributes correspond to a primary key in the
referenced (parent) table and are known as foreign keys in the referencing
(child) table. In one-to-many relationships, the referenced table corresponds
to the entity of the “one” side of the relationship and the referencing table
corresponds to the entity of the “many” side. In many- to-many relationships,
it is used an additional association table that associates the entities involved

through their respective primary keys. The Relational Model also features
the concept of View, which is like a table whose rows are not explicitly
stored in the database, but are computed as needed from a view definition.
Instead, a view is defined as a query on one or more base tables or other
views [17] .
Entity-Relationship (ER) Model. The Entity Relationship (ER) Model
[20] , proposed by Chen in 1976, appeared as an alternative to the Relational
Model in order to provide more expressiveness and semantics into the
database design from the user’s point of view. The ER model is a semantic
data model, i.e. aims to represent the meaning of the data involved on some
specific domain. This model was originally defined by three main concepts:
entities, relationships and attributes. An entity corresponds to an object in the
real world that is distinguishable from all other objects and is characterized
by a set of attributes. Each attribute has a range of possible values, known
as its domain, and each entity has its own value for each attribute. Similarly
to the Relational Model, the set of attributes that identify an entity is known
as its primary key.
Entities can be though as nouns and correspond to the tables of the
Relational Model. In turn, a relationship is an association established among
two or more entities. A relationship can be thought as a verb and includes
the roles of each participating entities with multiplicity constraints, and
their cardinality. For instance, a relationship can be of one-to-one (1:1),
one-to-many (1:M) or many-to-many (M:N). In an ER diagram, entities are
usually represented as rectangles, attributes as circles connected to entities
or relationships through a line, and relationships as diamonds connected to
the intervening entities through a line.
The Enhanced ER Model [21] provided additional concepts to represent
more complex requirements, such as generalization, specialization,
aggregation and composition. Other popular variants of ER diagram
notations are Crow’s foot, Bachman, Barker’s, IDEF1X and UML Profile
for Data Modeling [22] .
Decision Support Databases

The evolution of relational databases to decision support databases,
hereinafter indistinctly referred as “Data Warehouses” (DWs), occurred
with the need of storing operational but also historical data, and the need
of analyzing that data in complex dashboards and reports. Even though a
DW seems to be a relational database, it is different in the sense that DWs

are more suitable for supporting query and analysis operations (fast reads)
instead of transaction processing (fast reads and writes) operations. DWs
contain historical data that come from transactional data, but they also might
include other data sources [23] . DWs are mainly used for OLAP (online
analytical processing) operations. OLAP is the approach to provide report
data from DW through multi-dimensional queries and it is required to
create a multi-dimensional database [24] .
Usually, DWs include a framework that allows extracting data from
multiple data sources and transform it before loading to the repository,
which is known as ETL (Extract Transform Load) framework [23] .
Data modeling in DW consists in defining fact tables with several
dimension tables, suggesting star or snowflake schema data models [23] . A
star schema has a central fact table linked with dimension tables. Usually, a
fact table has a large number of attributes (in many cases in a denormalized
way), with many foreign keys that are the primary keys to the dimension
tables. The dimension tables represent characteristics that describe the fact
table. When star schemas become too complex to be queried efficiently they
are transformed into multi-dimensional arrays of data called OLAP cubes
(for more information on how this transformation is performed the reader
can consult the following references [24] [25] ).
A star schema is transformed to a cube by putting the fact table on the
front face that we are facing and the dimensions on the other faces of the cube
[24] . For this reason, cubes can be equivalent to star schemas in content,
but they are accessed with more platform-specific languages than SQL that
have more analytic capabilities (e.g. MDX or XMLA). A cube with three
dimensions is conceptually easier to visualize and understand, but the OLAP
cube model supports more than three dimensions, and is called a hypercube.
Figure 2 shows two examples of star schemas regarding the case study
AMS. The star schema on the left represents the data model for the Student’s
fact, while the data model on the right represents the Professor’s fact. Both
of them have a central fact table that contains specific attributes of the entity
in analysis and also foreign keys to the dimension tables. For example, a
Student has a place of origin (DIM_PLACEOFORIGIN) that is described
by a city and associated to a country (DIM_COUNTRY) that has a name and
an ISO code. On the other hand, Figure 3 shows a cube model with three
dimensions for the Student. These dimensions are represented by sides of
the cube (Student, Country and Date). This cube is useful to execute queries
such as: the students by country enrolled for the first time in a given year.
A challenge that DWs face is the growth of data, since it affects the
number of dimensions and levels in either the star schema or the cube
hierarchies. The increasing number of dimensions over time makes the
management of such systems often impracticable; this problem becomes
even more serious when dealing with Big Data scenarios, where data is
continuously being generated [23] .
Figure 2: Example of two star schema models for the Academic Management
System.
Figure 3: Example of a cube model for the Academic Management System.
Big Data Technologies

The volume of data has been exponentially increasing over the last years,
namely due to the simultaneous growth of the number of sources (e.g. users,
systems or sensors) that are continuously producing data. These data sources
produce huge amounts of data with variable representations that make their
management by the traditional RDBMSs and DWs often impracticable.
Therefore, there is a need to devise new data models and technologies that
can handle such Big Data.
NoSQL (Not Only SQL) [26] is one of the most popular approaches to
deal with this problem. It consists in a group of non-relational DBMSs that
consequently do not represent databases using tables and usually do not use
SQL for data manipulation. NoSQL systems allow managing and storing
large-scale denormalized datasets, and are designed to scale horizontally.
They achieve that by compromising consistency in favor of availability and
partition-tolerance, according to Brewer’s CAP theorem [27] . Therefore,
NoSQL systems are “eventually consistent”, i.e. assume that writes on the
data are eventually propagated over time, but there are limited guarantees
that different users will read the same value at the same time. NoSQL
provides BASE guarantees (Basically Available, Soft state and Eventually
consistent) instead of the traditional ACID guarantees, in order to greatly
improve performance and scalability [28] .
NoSQL databases can be classified in four categories [29] : Key-value

stores, (2) Document-oriented databases, (3) Wide-column stores, and (4)
Graph databases.
Key-value Stores. A Key-Value store represents data as a collection
(known as dictionary or map) of key- value pairs. Every key consists in a
unique alphanumeric identifier that works like an index, which is used to
access a corresponding value. Values can be simple text strings or more
complex structures like arrays. The Key-value model can be extended to an
ordered model whose keys are stored in lexicographical order. The fact of
being a simple data model makes Key-value stores ideally suited to retrieve
information in a very fast, available and scalable way. For instance, Amazon
makes extensive use of a Key-value store system, named Dynamo, to manage
the products in its shopping cart [30] . Amazon’s Dynamo and Voldemort,
which is used by Linkedin, are two examples of systems that apply this data
model with success. An example of a key-value store for both students and
professors of the Academic Managements System is shown in Figure 4.
Document-oriented Databases. Document-oriented databases (or
document stores) were originally created to store traditional documents, like
a notepad text file or Microsoft Word document. However, their concept
of document goes beyond that, and a document can be any kind of domain
object [26] . Documents contain encoded data in a standard format like
XML, YAML, JSON or BSON (Binary JSON) and are univocally identified
in the database by a unique key. Documents contain semi-structured data
represented as name-value pairs, which can vary according to the row
and can nest other documents. Unlike key-value stores, these systems
support secondary indexes and allow fully searching either by keys or
values. Document databases are well suited for storing and managing huge
collections of textual documents (e.g. text files or email messages), as well
as semi-structured or denormalized data that would require an extensive
use of “nulls” in an RDBMS [30] . MongoDB and CouchDB are two of the
most popular Document-oriented database systems. Figure 5 illustrates two
collections of documents for both students and professors of the Academic
Management System.
Figure 4. Example of a key-value store for the Academic Management Sys-

tem.
Figure 5: Example of a documents-oriented database for the Academic Man-

agement System.
Wide-column Stores. Wide-column stores (also known as column-
family stores, extensible record stores or column-oriented databases)
represent and manage data as sections of columns rather than rows (like in
RDBMS). Each section is composed of key-value pairs, where the keys are
rows and the values are sets of columns, known as column families. Each
row is identified by a primary key and can have column families different
of the other rows. Each column family also acts as a primary key of the set
of columns it contains. In turn each column of column family consists in

a name-value pair. Column families can even be grouped in super column
families [29] . This data model was highly inspired by Google’s BigTable
[31] . Wide-column stores are suited for scenarios like: (1) Distributed data
storage; (2) Large-scale and batch-oriented data processing, using the famous
MapReduce method for tasks like sorting, parsing, querying or conversion
and; (3) Exploratory and predictive analytics. Cassandra and Hadoop HBase
are two popular frameworks of such data management systems [29] . Figure
6 depicts an example of a wide-column store for the entity “person” of the
Academic Managements System.
Graph Databases. Graph databases represent data as a network of nodes
(representing the domain entities) that are connected by edges (representing
the relationships among them) and are characterized by properties expressed
as key-value pairs. Graph databases are quite useful when the focus is
on exploring the relationships between data, such as traversing social
networks, detecting patterns or infer recommendations. Due to their visual
representation, they are more user-friendly than the aforementioned types
of NoSQL databases. Neo4j and Allegro Graph are two examples of such
systems.
DATA ANALYTICS
This section presents and discusses the types of operations that can be
performed over the data models described in the previous section and also
establishes comparisons between them. A complementary discussion is
provided in Section 4.
Operational Databases
Systems using operational databases are designed to handle a high number
of transactions that usually perform changes to the operational data, i.e. the
data an organization needs to assure its everyday normal operation. These
systems are called Online Transaction Processing (OLTP) systems and they
are the reason why RDBMSs are so essential nowadays. RDBMSs have
increasingly been optimized to perform well in OLTP systems, namely
providing reliable and efficient data processing [16] .
The set of operations supported by RDBMSs is derived from the
relational algebra and calculus underlying the Relational Model [15] . As
mentioned before, SQL is the standard language to perform these operations.
SQL can be divided in two parts involving different types of operations:
Data Definition Language (SQL-DDL) and Data Manipulation Language

(SQL-DML).
SQL-DDL allows performing the creation (CREATE), update (UPDATE)
and deletion (DROP) of the various database objects.
Figure 6: Example of a wide-column store for the Academic Management Sys-

tem.
First it allows managing schemas, which are named collections of all
the database objects that are related to one another. Then inside a schema,
it is possible to manage tables specifying their columns and types, primary
keys, foreign keys and constraints. It is also possible to manage views,
domains and indexes. An index is a structure that speeds up the process of
accessing to one or more columns of a given table, possibly improving the
performance of queries [15] [16] .
For example, considering the Academic Management System, a system
manager could create a table for storing information of a student by executing
the following SQL-DDL command:
On the other hand, SQL-DML is the language that enables to manipulate

database objects and particularly to extract valuable information from the
database. The most commonly used and complex operation is the SELECT
operation, which allows users to query data from the various tables of a
database. It is a powerful operation because it is capable of performing in a
single query the equivalent of the relational algebra’s selection, projection
and join operations. The SELECT operation returns as output a table with
the results. With the SELECT operation is simultaneously possible to: define
which tables the user wants to query (through the FROM clause), which
rows satisfy a particular condition (through the WHERE clause), which
columns should appear in the result (through the SELECT clause), order the
result (in ascending or descending order) by one or more columns (through
the ORDER BY clause), group rows with the same column values (through
the GROUP BY clause) and filter those groups based on some condition
(through the HAVING clause). The SELECT operation also allows using
aggregation functions, which perform arithmetic computation or aggregation
of data (e.g. counting or summing the values of one or more columns).
Many times there is the need to combine columns of more than one table
in the result. To do that, the user can use the JOIN operation in the query. This
operation performs a subset of the Cartesian product between the involved
tables, i.e. returns the row pairs where the matching columns in each table
have the same value. The most common queries that use joins involve tables
that have one-to-many relationships. If the user wants to include in the result
the rows that did not satisfied the join condition, then he can use the outer
joins operations (left, right and full outer join). Besides specifying queries,
DML allows modifying the data stored in a database. Namely, it allows
adding new rows to a table (through the INSERT statement), modifying
the content of a given table’s rows (through the UPDATE statement) and
deleting rows from a table (through the DELETE statement) [16] . SQL-
DML also allows combining the results of two or more queries into a single
result table by applying the Union, Intersect and Except operations, based
on the Set Theory [15] .
For example, considering the Academic Management System, a system
manager could get a list of all students who are from G8 countries by entering
the following SQL-DML query:
Decision Support Databases

The most common data model used in DW is the OLAP cube, which
offers a set of operations to analyze the cube model [23] . Since data is
conceptualized as a cube with hierarchical dimensions, its operations have
familiar names when manipulating a cube, such as slice, dice, drill and pivot.
Figure 7 depicts these operations considering the Student’s facts of the AMS
case study (see Figure 2).
The slice operation begins by selecting one of the dimensions (or faces)
of the cube. This dimension is the one we want to consult and it is followed
by “slicing” the cube to a specific depth of interest. The slice operation
leaves us with a more restricted selection of the cube, namely the dimension
we wanted (front face) and the layer of that dimension (the sliced section).
In the example of Figure 7 (top-left), the cube was sliced to consider only
data of the year 2004.
Dice is the operation that allows restricting the front face of the cube
by reducing its size to a smaller targeted domain. This means that the user
produces a smaller “front face” than the one he had at the start. Figure 7 (top-
right) shows that the set of students has decreased after the dice operation.
Drill is the operation that allows to navigate by specifying different
levels of the dimensions, ranging from the most detailed ones (drill down) to
the most summarized ones (drill up). Figure 7 (bottom-left) shows the drill
down so the user can see the cities from where the students of the country
Portugal come from.
The pivot operation allows changing the dimension that is being faced
(change the current front face) to one that is adjacent to it by rotating the
cube. By doing this, the user obtains another perspective of the data, which
requires the queries to have a different structure but can be more beneficial
for specific queries. For instance, he can slice and dice the cube away to get
the results he needed, but sometimes with a pivot most of those operations
can be avoided by perceiving a common structure on future queries and
pivoting the cube in the correct fashion [23] [24] . Figure 7 (bottom-right)
shows a pivot operation where years are arranged vertically and countries
horizontally.
The usual operations issued over the OLAP cube are about just querying
historical events stored in it. So, a common dimension is a dimension
associated to time.
Figure 7: Representation of cube operations for the Academic Management

System: slice (top-left), dice (top-right), drill up/down (bottom-left) and pivot
(bottom-right).
The most popular language for manipulating OLAP cubes is MDX
(Multidimensional Expressions) [32] , which is a query language for
OLAP databases that supports all the operations mentioned above. MDX
is exclusively used to analyze and read data since it was not designed with
SQL-DML in mind. The star schema and the OLAP cube are designed a priori
with a specific purpose in mind and cannot accept queries that differentiate
much from the ones they were design to respond too. The benefit in this,
is that queries are much simpler and faster, and by using a cube it is even
quicker to detect patterns, find trends and navigate around the data while
“slicing and dicing” with it [23] [25] .
Again considering the Academic Management System example, the
following query represents an MDX select statement. The SELECT clause
sets the query axes as the name and the gender of the Student dimension
and the year 2015 of the Date dimension. The FROM clause indicates the
data source, here being the Students cube, and the WHERE clause defines
the slicer axis as the “Computer Science” value of the Academic Program
dimension. This query returns the students (by names and gender) that have
enrolled in Computer Science in the year 2015.
Big Data Technologies

Big Data Analytics consists in the process of discovering and extracting
potentially useful information hidden in huge amounts of data (e.g. discover
unknown patterns and correlations). Big Data Analytics can be separated
in the following categories: (1) Batch-oriented processing; (2) Stream
processing; (3) OLTP and; (4) Interactive ad-hoc queries and analysis.
Batch-oriented processing is a paradigm where a large volume of data
is firstly stored and only then analyzed, as opposed to Stream processing.
This paradigm is very common to perform large-scale recurring tasks in
parallel like parsing, sorting or counting. The most popular batch-oriented
processing model is MapReduce [5] , and more specifically its open-source
implementation in Hadoop1. MapReduce is based on the divide and conquer
(D&C) paradigm to break down complex Big Data problems into small
sub-problems and process them in parallel. MapReduce, as its name hints,
comprises two major functions: Map and Reduce. First, data is divided into
small chunks and distributed over a network of nodes. Then, the Map function,
which performs operations like filtering or sorting, is applied simultaneously
to each chunk of data generating intermediate results. After that, those
intermediate results are aggregated through the Reduce function in order to
compute the final result. Figure 8 illustrates an example of the application of
MapReduce in order to calculate the number of students enrolled in a given
academic program by year. This model schedules computation resources
close to data location, which avoids the communication overhead of data
transmission. It is simple and widely applied in bioinformatics, web mining
and machine learning. Also related to Hadoop’s environment, Pig2 and
Hive3 are two frameworks used to express tasks for Big Data sets analysis
in MapReduce programs. Pig is suitable for data flow tasks and can produce
sequences of MapReduce programs, whereas Hive is more suitable for data
summarization, queries and analysis. Both of them use their own SQL-like
languages, Pig Latin and Hive QL, respectively [33] . These languages use
both CRUD and ETL operations.
Streaming processing is a paradigm where data is continuously arriving
in a stream, at real-time, and is analyzed as soon as possible in order to
derive approximate results. It relies in the assumption that the potential
value of data depends on its freshness. Due to its volume, only a portion
of the stream is stored in memory [33] . Streaming processing paradigm is
used in online applications that need real-time precision (e.g. dashboards of
production lines in a factory, calculation of costs depending on usage and
available resources). It is supported by Data Stream Management Systems
(DSMS) that allow performing SQL-like queries (e.g. select, join, group,
count) within a given window of data. This window establishes the period
of time (based on time) or number of events (based on length) [34] . Storm
and S4 are two examples of such systems.
Figure 8: Example of Map Reduce applied to the Academic Management Sys-

tem.
OLTP, as we have seen before, is mainly used in the traditional RDBMS.
However, these systems cannot assure an acceptable performance when the
volume of data and requests is huge, like in Facebook or Twitter. Therefore,
it was necessary to adopt NoSQL databases that allow achieving very high
performances in systems with such large loads. Systems like Cassandra4,

HBase5 or MongoDB6 are effective solutions currently used. All of them
provide their own query languages with equivalent CRUD operations to
the ones provided by SQL. For example, in Cassandra is possible to create
Column Families using CQL, in HBase is possible to delete a column using
Java, and in MongoDB insert a document into a collection using JavaScript.
Below there is a query in JavaScript for a MongoDB database equivalent to
the SQL-DML query presented previously.
At last, Interactive ad-hoc queries and analysis consists in a paradigm

that allows querying different large- scale data sources and query interfaces
with a very low latency. This type of systems argue that queries should
not need more then few seconds to execute even in a Big Data scale, so
that users are able to react to changes if needed. The most popular of these
systems is Drill7. Drill works as a query layer that transforms a query written
in a human-readable syntax (e.g. SQL) into a logical plan (query written in
a platform-independent way). Then, the logical plan is transformed into a
physical plan (query written in a platform-specific way) that is executed in
the desired data sources (e.g. Cassandra, HBase or MongoDB) [35] .
DISCUSSION
In this section we compare and discuss the approaches presented in the
previous sections in terms of the two perspectives that guide this survey:
Data Modeling and Data Analytics. Each perspective defines a set of features
used to compare Operational Databases, DWs and Big Data approaches
among themselves.
Regarding the Data Modeling Perspective, Table 2 considers the
following features of analysis: (1) the data model; (2) the abstraction
level in which the data model resides, according to the abstraction levels
(Conceptual, Logical and Physical) of the database design process; (3) the
concepts or constructs that compose the data model;
Table 2: Comparison of the approaches from the Data Modeling Perspective
(4) the concrete languages used to produce the data models and that apply
the previous concepts; (5) the modeling tools that allow specifying diagrams
using those languages and; (6) the database tools that support the data model.
Table 2 presents the values of each feature for each approach. It is possible to
verify that the majority of the data models are at a logical and physical level,
with the exception of the ER model and the OLAP cube model, which are
more abstract and defined at conceptual and logical levels. It is also possible
to verify that Big Data has more data models than the other approaches, what
can explain the work and proposals that have been conducted over the last
years, as well as the absence of a de facto data model. In terms of concepts,
again Big Data-related data models have a more variety of concepts than
the other approaches, ranging from key-value pairs or documents to nodes
and edges. Concerning concrete languages, it is concluded that every data
model presented in this survey is supported by a SQL-DDL-like language.

However, we found that only the operational databases and DWs have
concrete languages to express their data models in a graphical way, like
Chen’s notation for ER model, UML Data Profile for Relational model or
CWM [36] for multidimensional DW models. Also, related to that point,
there are none modeling tools to express Big Data models. Thus, defining
such a modeling language and respective supporting tool for Big Data
models constitute an interesting research direction that fills this lack. At last,
all approaches have database tools that support the development based on
their data models, with the exception of the ER model that is not directly
used by DBMSs.
On the other hand, in terms of the Data Analytics Perspective, Table
3 considers six features of analysis: (1) the class of application domains,
which characterizes the approach suitability; (2) the common operations
used in the approach, which can be reads and/or writes; (3) the operations
types most typically used in the approach; (4) the concrete languages used to
specify those operations; (5) the abstraction level of these concrete languages
(Conceptual, Logical and Physical); and (6) the technology support of these
languages and operations.
Table 3 shows that Big Data is used in more classes of application domains
than the operational databases and DWs, which are used for OLTP and
OLAP domains, respectively. It is also possible to observe that operational
databases are commonly used for reads and writes of small operations (using
transactions), because they need to handle fresh and critical data in a daily
basis. On the other hand, DWs are mostly suited for read operations, since
they perform analysis and data mining mostly with historical data. Big Data
performs both reads and writes, but in a different way and at a different scale
from the other approaches. Big Data applications are built to perform a huge
amount of reads, and if a huge amount of writes is needed, like for OLTP, they
sacrifice consistency (using “eventually consistency”) in order to achieve
great availability and horizontal scalability. Operational databases support
their data manipulation operations (e.g. select, insert or delete) using SQL-
ML, which has slight variations according to the technology used. DWs also
use SQL-ML through the select statement, because their operations (e.g.
slice, dice or drill down/up) are mostly reads. DWs also use SQL-based
languages, like MDX and XMLA (XML for Analysis) [37] , for specifying
their operations. On the other hand, regarding Big Data technologies, there is
a great variety of languages to manipulate data according to the different class
application domains. All of these languages provide equivalent operations
to the ones offered by SQL-ML and add new constructs for supporting
both ETL, data stream processing (e.g. create stream, window) [34] and
MapReduce operations. It is important to note that concrete languages used
in the different approaches reside at logical and physical levels, because
they are directly used by the supporting software tools.
RELATED WORK
As mentioned in Section 1, the main goal of this paper is to present and
discuss the concepts surrounding data modeling and data analytics, and
their evolution for three representative approaches: operational databases,
decision support databases and Big Data technologies.
Table 3: Comparison of the approaches from the Data Analytics perspective
In our survey we have researched related works that also explore and
compare these approaches from the data modeling or data analytics point
of view.
J.H. ter Bekke provides a comparative study between the Relational,
Semantic, ER and Binary data models based on an examination session
results [38] . In that session participants had to create a model of a case
study, similar to the Academic Management System used in this paper. The
purpose was to discover relationships between the modeling approach in
use and the resulting quality. Therefore, this study just addresses the data
modeling topic, and more specifically only considers data models associated
to the database design process.
Several works focus on highlighting the differences between operational
databases and data warehouses. For example, R. Hou provides an analysis
between operational databases and data warehouses distinguishing them
according to their related theory and technologies, and also establishing
common points where combining both systems can bring benefits [39] . C.
Thomsen and T.B. Pedersen compare open source ETL tools, OLAP clients
and servers, and DBMSs, in order to build a Business Intelligence (BI)
solution [40] .
P. Vassiliadis and T. Sellis conducted a survey that focuses only on OLAP
databases and compare various proposals for the logical models behind
them. They group the various proposals in just two categories: commercial
tools and academic efforts, which in turn are subcategorized in relational
model extensions and cube- oriented approaches [41] . However, unlike our
survey they do not cover the subject of Big Data technologies.
Several papers discuss the state of the art of the types of data stores,
technologies and data analytics used in Big Data scenarios [29] [30] [33]
[42] , however they do not compare them with other approaches. Recently, P.
Chandarana and M. Vijayalakshmi focus on Big Data analytics frameworks
and provide a comparative study according to their suitability [35] .
Summarizing, none of the following mentioned work provides such a
broad analysis like we did in this paper, namely, as far as we know, we
did not find any paper that compares simultaneously operational databases,
decision support databases and Big Data technologies. Instead, they focused
on describing more thoroughly one or two of these approaches
CONCLUSIONS
In recent years, the term Big Data has appeared to classify the huge datasets
that are continuously being produced from various sources and that are
represented in a variety of structures. Handling this kind of data represents
new challenges, because the traditional RDBMSs and DWs reveal serious
limitations in terms of performance and scalability when dealing with such
a volume and variety of data. Therefore, it is needed to reinvent the ways in
which data is represented and analyzed, in order to be able to extract value
from it.
This paper presents a survey focused on both these two perspectives:

data modeling and data analytics, which are reviewed in terms of the three
representative approaches nowadays: operational databases, decision
support databases and Big Data technologies. First, concerning data
modeling, this paper discusses the most common data models, namely:
relational model and ER model for operational databases; star schema
model and OLAP cube model for decision support databases; and key-value
store, document-oriented database, wide-column store and graph database
for Big Data-based technologies. Second, regarding data analytics, this
paper discusses the common operations used for each approach. Namely, it
observes that operational databases are more suitable for OLTP applications,
decision support databases are more suited for OLAP applications, and Big
Data technologies are more appropriate for scenarios like batch-oriented
processing, stream processing, OLTP and interactive ad-hoc queries and
analysis.
Third, it compares these approaches in terms of the two perspectives and
based on some features of analysis. From the data modeling perspective,
there are considered features like the data model, its abstraction level, its
concepts, the concrete languages used to described, as well as the modeling
and database tools that support it. On the other hand, from the data analytics
perspective, there are taken into account features like the class of application
domains, the most common operations and the concrete languages used to
specify those operations. From this analysis, it is possible to verify that there
are several data models for Big Data, but none of them is represented by any
modeling language, neither supported by a respective modeling tool. This
issue constitutes an open research area that can improve the development
process of Big Data targeted applications, namely applying a Model-Driven
Engineering approach [12] -[14] . Finally, this paper also presents some
related work on the data modeling and data analytics areas.
As future work, we consider that this survey may be extended to
capture additional aspects and comparison features that are not included in
our analysis. It will be also interesting to survey concrete scenarios where
Big Data technologies prove to be an asset [43] . Furthermore, this survey
constitutes a starting point for our ongoing research goals in the context
of the Data Storm and MDD Lingo initiatives. Specifically, we intend to
extend existing domain-specific modeling languages, like XIS [44] and
XIS-Mobile [45] [46] , and their MDE-based framework to support both
the data modeling and data analytics of data-intensive applications, such as
those researched in the scope of the Data Storm initiative [47] - [50] .
ACKNOWLEDGEMENTS
This work was partially supported by national funds through FCT―Fundação
para a Ciência e a Tecnologia, under the projects POSC/EIA/57642/2004,
CMUP-EPB/TIC/0053/2013, UID/CEC/50021/2013 and Data Storm
Research Line of Excellency funding (EXCL/EEI-ESS/0257/2012).
NOTES
1
https://hadoop.apache.org
2
https://pig.apache.org
3
https://hive.apache.org
4
http://cassandra.apache.org
5
https://hbase.apache.org
6
https://www.mongodb.org
7
https://drill.apache.org
REFERENCES
1. Mayer-Schonberger, V. and Cukier, K. (2014) Big Data: A Revolution
That Will Transform How We Live, Work, and Think. Houghton
Mifflin Harcourt, New York.
2. Noyes, D. (2015) The Top 20 Valuable Facebook Statistics.https://
zephoria.com/top-15-valuable-facebook-statistics
3. Shvachko, K., Hairong Kuang, K., Radia, S. and Chansler, R. (2010)
The Hadoop Distributed File System. 26th Symposium on Mass
Storage Systems and Technologies (MSST), Incline Village, 3-7 May
2010, 1-10.http://dx.doi.org/10.1109/msst.2010.5496972
4. White, T. (2012) Hadoop: The Definitive Guide. 3rd Edition, O’Reilly
Media, Inc., Sebastopol.
5. Dean, J. and Ghemawat, S. (2008) MapReduce: Simplified Data
Processing on Large Clusters. Communications, 51, 107-113.http://dx.
doi.org/10.1145/1327452.1327492
6. Hurwitz, J., Nugent, A., Halper, F. and Kaufman, M. (2013) Big Data
for Dummies. John Wiley & Sons, Hoboken.
7. Beyer, M.A. and Laney, D. (2012) The Importance of “Big Data”: A
Definition. Gartner. https://www.gartner.com/doc/2057415
8. Duncan, A.D. (2014) Focus on the “Three Vs” of Big Data Analytics:
Variability, Veracity and Value. Gartner.https://www.gartner.com/
doc/2921417/focus-vs-big-data-analytics
9. Agrawal, D., Das, S. and El Abbadi, A. (2011) Big Data and Cloud
Computing: Current State and Future Opportunities. Proceedings
of the 14th International Conference on Extending Database
Technology, Uppsala, 21-24 March, 530-533.http://dx.doi.
org/10.1145/1951365.1951432
10. McAfee, A. and Brynjolfsson, E. (2012) Big Data: The Management
Revolution. Harvard Business Review.
11. DataStorm Project Website.http://dmir.inesc-id.pt/project/DataStorm.
12. Stahl, T., Voelter, M. and Czarnecki, K. (2006) Model-Driven Software
Development: Technology, Engineering, Management. John Wiley &
Sons, Inc., New York.
13. Schmidt, D.C. (2006) Guest Editor’s Introduction: Model-Driven
Engineering. IEEE Computer, 39, 25-31.http://dx.doi.org/10.1109/
MC.2006.58
14. Silva, A.R. (2015) Model-Driven Engineering: A Survey Supported

by the Unified Conceptual Model. Computer Languages, Systems &
Structures, 43, 139-155.
15. Ramakrishnan, R. and Gehrke, J. (2012) Database Management
Systems. 3rd Edition, McGraw-Hill, Inc., New York.
16. Connolly, T.M. and Begg, C.E. (2005) Database Systems: A Practical
Approach to Design, Implementation, and Management. 4th Edition,
Pearson Education, Harlow.
17. Codd, E.F. (1970) A Relational Model of Data for Large Shared Data
Banks. Communications of the ACM, 13, 377-387.http://dx.doi.
org/10.1145/362384.362685
18. Bachman, C.W. (1969) Data Structure Diagrams. ACM SIGMIS
Database, 1, 4-10.http://dx.doi.org/10.1145/1017466.1017467
19. Chamberlin, D.D. and Boyce, R.F. (1974) SEQUEL: A Structured
English Query Language. In: Proceedings of the 1974 ACM SIGFIDET
(Now SIGMOD) Workshop on Data Description, Access and Control
(SIGFIDET’ 74), ACM Press, Ann Harbor, 249-264.
20. Chen, P.P.S. (1976) The Entity-Relationship Model—Toward a Unified
View of Data. ACM Transactions on Database Systems, 1, 9-36.http://
dx.doi.org/10.1145/320434.320440
21. Tanaka, A.K., Navathe, S.B., Chakravarthy, S. and Karlapalem, K.
(1991) ER-R, an Enhanced ER Model with Situation-Action Rules to
Capture Application Semantics. Proceedings of the 10th International
Conference on Entity-Relationship Approach, San Mateo, 23-25
October 1991, 59-75.
22. Merson, P. (2009) Data Model as an Architectural View. Technical Note
CMU/SEI-2009-TN-024, Software Engineering Institute, Carnegie
Mellon.
23. Kimball, R. and Ross, M. (2013) The Data Warehouse Toolkit: The
Complete Guide to Dimensional Modeling. 3rd Edition, John Wiley &
Sons, Inc., Indianapolis.
24. Zhang, D., Zhai, C., Han, J., Srivastava, A. and Oza, N. (2009) Topic
Modeling for OLAP on Multidimensional Text Databases: Topic Cube
and Its Applications. Statistical Analysis and Data Mininig, 2, 378-
395.http://dx.doi.org/10.1002/sam.10059
25. Gray, J., et al. (1997) Data Cube: A Relational Aggregation
Operator Generalizing Group-By, Cross-Tab, and Sub-Totals.
Data Mining and Knowledge Discovery, 1, 29-53.http://dx.doi.

org/10.1023/A:1009726021843
26. Cattell, R. (2011) Scalable SQL and NoSQL Data Stores. ACM SIGMOD
Record, 39, 12-27.http://dx.doi.org/10.1145/1978915.1978919
27. Gilbert, S. and Lynch, N. (2002) Brewer’s Conjecture and the
Feasibility of Consistent, Available, Partition-Tolerant Web Services.
ACM SIGACT News, 33, 51-59.
28. Vogels, W. (2009) Eventually Consistent. Communications of the
ACM, 52, 40-44.http://dx.doi.org/10.1145/1435417.1435432
29. Grolinger, K., Higashino, W.A., Tiwari, A. and Capretz, M.A.M. (2013)
Data Management in Cloud Environments: NoSQL and NewSQL
Data Stores. Journal of Cloud Computing: Advances, Systems and
Applications, 2, 22.http://dx.doi.org/10.1186/2192-113x-2-22
30. Moniruzzaman, A.B.M. and Hossain, S.A. (2013) NoSQL Database:
New Era of Databases for Big data Analytics-Classification,
Characteristics and Comparison. International Journal of Database
Theory and Application, 6, 1-14.
31. Chang, F., et al. (2006) Bigtable: A Distributed Storage System for
Structured Data. Proceedings of the 7th Symposium on Operating
Systems Design and Implementation (OSDI’ 06), Seattle, 6-8
November 2006, 205-218.
32. Spofford, G., Harinath, S., Webb, C. and Civardi, F. (2005) MDX
Solutions: With Microsoft SQL Server Analysis Services 2005 and
Hyperion Essbase. John Wiley & Sons, Inc., Indianapolis.
33. Hu, H., Wen, Y., Chua, T.S. and Li, X. (2014) Toward Scalable Systems
for Big Data Analytics: A Technology Tutorial. IEEE Access, 2, 652-
687.http://dx.doi.org/10.1109/ACCESS.2014.2332453
34. Golab, L. and Ozsu, M.T. (2003) Issues in Data Stream Management.ACM
SIGMOD Record, 32, 5-14.http://dx.doi.org/10.1145/776985.776986
35. Chandarana, P. and Vijayalakshmi, M. (2014) Big Data Analytics
Frameworks. Proceedings of the International Conference on Circuits,
Systems, Communication and Information Technology Applications
(CSCITA), Mumbai, 4-5 April 2014, 430-434.http://dx.doi.
org/10.1109/cscita.2014.6839299
36. Poole, J., Chang, D., Tolbert, D. and Mellor, D. (2002) Common
Warehouse Metamodel. John Wiley & Sons, Inc., New York.
37. XML for Analysis (XMLA) Specification.https://msdn.microsoft.com/

en-us/library/ms977626.aspx.
38. ter Bekke, J.H. (1997) Comparative Study of Four Data Modeling
Approaches. Proceedings of the 2nd EMMSAD Workshop, Barcelona,
16-17 June 1997, 1-12.
39. Hou, R. (2011) Analysis and Research on the Difference between Data
Warehouse and Database. Proceedings of the International Conference
on Computer Science and Network Technology (ICCSNT), Harbin,
24-26 December 2011, 2636-2639.
40. Thomsen, C. and Pedersen, T.B. (2005) A Survey of Open Source
Tools for Business Intelligence. Proceedings of the 7th International
Conference on Data Warehousing and Knowledge Discovery
(DaWaK’05), Copenhagen, 22-26 August 2005, 74-84.http://dx.doi.
org/10.1007/11546849_8
41. Vassiliadis, P. and Sellis, T. (1999) A Survey of Logical Models for
OLAP Databases. ACM SIGMOD Record, 28, 64-69.http://dx.doi.
org/10.1145/344816.344869
42. Chen, M., Mao, S. and Liu, Y. (2014) Big Data: A Survey. Mobile
Networks and Applications, 19, 171-209.http://dx.doi.org/10.1007/978-
3-319-06245-7
43. Chen, H., Hsinchun, R., Chiang, R.H.L. and Storey, V.C. (2012)
Business Intelligence and Analytics: From Big Data to Big Impact.
MIS Quarterly, 36, 1165-1188.
44. Silva, A.R., Saraiva, J., Silva, R. and Martins, C. (2007) XIS-UML
Profile for Extreme Modeling Interactive Systems. Proceedings of
the 4th International Workshop on Model-Based Methodologies for
Pervasive and Embedded Software (MOMPES’07), Braga, 31-31
March 2007, 55-66.http://dx.doi.org/10.1109/MOMPES.2007.19
45. Ribeiro, A. and Silva, A.R. (2014) XIS-Mobile: A DSL for Mobile
Applications. Proceedings of the 29th Symposium on Applied
Computing (SAC 2014), Gyeongju, 24-28 March 2014, 1316-1323.
http://dx.doi.org/10.1145/2554850.2554926
46. Ribeiro, A. and Silva, A.R. (2014) Evaluation of XIS-Mobile, a
Domain Specific Language for Mobile Application Development.
Journal of Software Engineering and Applications, 7, 906-919.http://
dx.doi.org/10.4236/jsea.2014.711081
47. Silva, M.J., Rijo, P. and Francisco, A. (2014). Evaluating the

Impact of Anonymization on Large Interaction Network Datasets.
In: Proceedings of the 1st International Workshop on Privacy and
Security of Big Data, ACM Press, New York, 3-10.http://dx.doi.
org/10.1145/2663715.2669610
48. Anjos, D., Carreira, P. and Francisco, A.P. (2014) Real-Time Integration
of Building Energy Data. Proceedings of the IEEE International
Congress on Big Data, Anchorage, 27 June-2 July 2014, 250-257.
http://dx.doi.org/10.1109/BigData.Congress.2014.44
49. Machado, C.M., Rebholz-Schuhmann, D., Freitas, A.T. and Couto,
F.M. (2015) The Semantic Web in Translational Medicine: Current
Applications and Future Directions. Briefings in Bioinformatics, 16,
89-103.http://dx.doi.org/10.1093/bib/bbt079
50. Henriques, R. and Madeira, S.C. (2015) Towards Robust Performance
Guarantees for Models Learned from High-Dimensional Data. In:
Hassanien, A.E., Azar, A.T., Snasael, V., Kacprzyk, J. and Abawajy,
J.H., Eds., Big Data in Complex Systems, Springer, Berlin, 71-104.
http://dx.doi.org/10.1007/978-3-319-11056-1_3
Chapter 4
Big Data Analytics for Business

Intelligence in Accounting and Audit
Mui Kim Chu, Kevin Ow Yong

ABSTRACT
Big data analytics represents a promising area for the accounting and
audit professions. We examine how machine learning applications, data
analytics and data visualization software are changing the way auditors and
accountants work with their clients. We find that audit firms are keen to use
machine learning software tools to read contracts, analyze journal entries,
and assist in fraud detection. In data analytics, predictive analytical tools are
utilized by both accountants and auditors to make projections and estimates,
and to enhance business intelligence (BI). In addition, data visualization
tools are able to complement predictive analytics to help users uncover
trends in the business process. Overall, we anticipate that the technological
advances in these various fields will accelerate in the coming years. Thus,
it is imperative that accountants and auditors embrace these technological
advancements and harness these tools to their advantage.
Citation: Chu, M. and Yong, K. (2021), “Big Data Analytics for Business Intelligence
in Accounting and Audit”. Open Journal of Social Sciences, 9, 42-52. doi: 10.4236/
jss.2021.99004.
Keywords: Data Analytics, Machine Learning, Data Visualization, Audit

Analytics
INTRODUCTION
Big data analytics has transformed the world that we live in. Due to
technological advances, big data analytics enables new forms of business
value and enterprise risk that will have an impact on the rules, standards and
practices for the finance and accounting professions. The accounting and
audit professionals are important players in harnessing the power of big data
analytics, and they are poised to become even more vital to stakeholders in
supporting data and insight-driven enterprises.
Data analytics can enable auditors to focus on exception reporting more
efficiently by identifying outliers in risky areas of the audit process (IAASB,
2018). The advent of inexpensive computational power and storage, as well
as the progressive computerization of organizational systems, is creating a
new environment in which accountants and auditors must adapt to harness
the power of big data analytics. In other applications, data analytics can help
auditors to improve the risk assessment process, substantive procedures and
tests of controls (Lim et al., 2020). These software tools have the potential to
provide further evidence to assist with audit judgements and provide greater
insights for audit clients.
In machine learning applications, the expectation is that the algorithm
will learn from the data provided, in a manner that is similar to how a human
being learns from data. A classic application of machine learning tools is
pattern recognition. Facial recognition machine learning software has been
developed such that a machine-learning algorithm can look at pictures of
men and women and be able to identify those features that are male driven
from those that are female driven. Initially, the algorithm might misclassify
some male faces as female faces. It is thus important for the programmer
to write an algorithm that can be trained using test data to look for specific
patterns in male and female faces.
Because machine learning requires large data sets in order to train the
learning algorithms, the availability of a vast quantity of high-quality data
will expedite the process by allowing the programmer to refine the machine
learning algorithms to be able to identify pictures that contain a male
face as opposed to a female face. Gradually, the algorithm will be able to
classify some general characteristics of a man (e.g., spotting a beard, certain
Big Data Analytics for Business Intelligence in Accounting and Audit 89
differences in hair styles, broad faces) from those that belong to a woman
(e.g., more feminine characteristics).
Similarly, it is envisaged that many routine accounting processes will be
handled by machine learning algorithms or robotics automation processing
(RPA) tools in the near future. For example, it is possible that machine
learning algorithms can receive an invoice, match it to a purchase order,
determine the expense account to charge and the amount to be paid, and
place it in a pool of payments for a human employee to review the documents
and release them for payment to the respective vendors.
Likewise, in auditing a client, a well designed machine learning
algorithm could make it easier to detect potential fraudulent transactions in
a company’s financial statements by training the machine learning algorithm
to successfully identify transactions that have characteristics associated with
fraudulent activities from bona fide transactions. The evolution of machine
learning is thus expected to have a dramatic impact on business, and it is
expected that the accounting profession will need to adapt so as to better
understand how to utilize such technologies in modifying their ways of
working when auditing financial statements of their audit clients (Haq,
Abatemarco, & Hoops, 2020).
Predictive analytics is a subset of data analytics. Predictive analytics can
be viewed as helping the accountant or auditor in understanding the future
and provides foresight by identifying patterns in historical data. One of the
most common applications of predictive analytics in the field of accounting
is the computation of a credit score to indicate the likelihood of timely future
credit payments. This predictive analytics tool can be used to predict an
accounts receivable balance at a certain date and to estimate a collection
period for each customer.
Data visualization tools are becoming increasingly popular because of
the way these tools help users obtain better insights, draw conclusions and
handle large datasets (Skapoullis, 2018). For example, auditors have begun
to use visualizations as a tool to look at multiple accounts over multiple
years to detect misstatements. If an auditor is attempting to examine a
company’s accounts payable (AP) balances over the last ten years compared
to the industry average, a data visualization tool like PowerBI or Tableau
can quickly produce a graph that compares two measures against one
dimension. The measures are the quantitative data, which are the company’s
AP balances versus the industry averages. The dimension is a qualitative
categorical variable. The difference between data visualization tools from a
simple Excel graph is that this information (“sheet’) can be easily formatted
and combined with other important information (“other sheets”) to create a
dashboard where numerous sheets are compiled to provide an overall view
that shows the auditor a cohesive audit examination of misstatement risk
or anomalies in the company’s AP balances. As real-time data is streamed
to update the dashboard, auditors could also examine the most current
transactions that affect AP balances; thus, enabling auditor to perform
continuous audit. With the real-time quality dashboard that provide real-
time alerts, it enables collaboration among the audit team on a real-time
continuous basis coupled with real-time supervisory review. Analytical
procedures and test of transactions can be done more continually, and the
auditor can investigate unusual fluctuations more promptly. The continuous
review can also help to even out the workload of the audit team as the audit
team members are kept abreast of the client’s business environment and
financial performance throughout the financial year.
The next section discusses machine learning applications to aid the
audit process. Section 3 describes predictive analytics and how accountants
and auditors use these tools to generate actionable insights for companies.
Section 4 discusses data visualization and its role in the accounting and audit
profession. Section 5 concludes.
MACHINE LEARNING
Machine Learning in Audit Processes

Machine learning is a subset of artificial intelligence that automates
analytical model building. Machine learning uses these models to perform
data analysis in order to understand patterns and make predictions. The
machines are programmed to use an iterative approach to learn from the
analyzed data, making the learning an automated and continuous process.
As the machine is exposed to greater amount of data, more robust patterns
are recognized. In turn, this iterative process helps to refine the data analysis
process. Machine learning and traditional statistical analysis are similar in
many aspects. However, while statistical analysis is based on probability
theory and probability distributions, machine learning is designed to find
that optimal combination of mathematical equations that best predict an
outcome. Thus, machine learning is well suited for a broad range of problems
that involve classification, linear regression, and cluster analysis.
The predictive reliability of machine learning applications is dependent

on the quality of the historical data that has been fed to the machine. New
and unforeseen events may create invalid results if they are left unidentified
or inappropriately weighted. As a result, human biases can influence the
use of machine learning. Such biases can affect which data sets are chosen
for training the AI application, the methods chosen for the process, and the
interpretation of the output. Finally, although machine learning technology
has great potential, its models are still currently limited by many factors,
including data storage and retrieval, processing power, algorithmic modeling
assumptions, and human errors and judgment.
Machine learning technology for auditing is a very promising
area (Dickey et al., 2019). Several of the Big 4 audit firms have machine
learning systems under development, and smaller audit firms are beginning
to benefit from improving viability of this technology. It is expected that
auditing standards will adapt to take into account the use of machine learning
in the audit process. Regulators and standard setters will also need to consider
how they can incorporate the impact of this technology in their regulatory
and decision making process. Likewise, educational programs will continue
to evolve to this new paradigm. We foresee that more accounting programs
with data analytics and machine learning specializations will become the
norm rather than the exception.
Although there are certain limitations to the current capability of
machine learning, it excels at performing repetitive tasks. Because an audit
process requires a vast amount of data and has a significant number of task-
related components, machine learning has the potential to increase both
the speed and quality of audits. By harnessing machine-based performance
of redundant tasks, it will free up more time for the auditors to undertake
review and analytical work.
Current Audit Use Cases

Audit firms are already testing and exploring the power of machine learning
in audits. One example is Deloitte’s use of Argus, a machine learning tool
that “learns” from every human interaction and leverages advanced machine
learning techniques and natural language processing to automatically
identify and extract key accounting information from any type of electronic
document such as leases, derivatives contracts, and sales contracts. Argus is
programmed with algorithms that allow it to identify key contract terms, as
well as trends and outliers. It is highly possible for a well-designed machine
to not just read a lease contract, identify key terms, determine whether it
is a capital or operating lease, but also to interpret nonstandard leases with
significant judgments (e.g., those with unusual asset retirement obligations).
This would allow auditors to review and assess larger samples—even up to
100% of the documents, spend more time on judgemental areas and provide
greater insights to audit clients, thus improving both the speed and quality
of the audit process.
Another example of machine learning technology currently used by
PricewaterhouseCoopers is Halo. Halo analyzes journal entries and can
identify potentially problematic areas, such as entries with keywords of a
questionable nature, entries from unauthorized sources, or an unusually high
number of journal entry postings just under authorized limits. Similar to
Argus, Halo allows auditors to test 100% of the journal entries and focusing
only on the outliers with the highest risk, both the speed and quality of the
testing procedures are significantly improved.
Potential Machine Learning Applications

Audit firms and academics are studying additional ways that machine
learning can be used in financial statement audits, particularly in the risk
assessment process. For example, machine learning technologies such
as speech recognition could be used to examine and diagnose executive
fraud interviews. The software can be used to identify situations when
interviewees give questionable answers, such as “sort of” or “maybe,”
that suggest potential deceptive behavior. Significant delays in responses,
which might also indicate deliberate concealment of information, can also
be picked up by such speech recognition technology Facial recognition
technologies can be applied toward fraud interviews as well. An AI software
that uses facial recognition can help to identify facial patterns that suggest
excess nervousness or deceit during entrant interviews. The assistance of
speech and facial recognition technology in fraud interviews could certainly
complement auditors and notify them when higher-risk responses warrant
further investigation.
A study was done to assess risk based on machine learning by using a
deep neural network (DNN) model to develop and test for a drive-off scenario
involving an Oil & Gas drilling rig. The results of the study show a reasonable
level of accuracy for DNN predictions and a partial suitability to overcome
risk assessment challenges. Perhaps such deep learning approach can be
extended to auditing by training the model on past indicators of inherent
risk, for the purpose of assessing risk of material misstatements. Data from
various exogenous sources, such as forum posts, comments, conversations
from social media, press release, news, management discussion notes, can
be used to supplement traditional financial attributes to train the model to
virtually assess the inherent risk levels (Paltrinieri et al., 2019).
The use of machine learning for risk assessment can also be applied
to assessment of going concern risk. By studying the traits of companies
that have gone under financial distress, a Probability of Default (PD) model
can be developed, with the aim to quantify the going concern on a timelier
basis. The predictive model requires an indicator of financial distress and a
set of indicators that leverage on environmental and financial performance
scanning to produce a PD that is dynamically updated according to firm
performance (Martens et al., 2008).
The impact on businesses and the accounting profession will
undoubtedly be significant in the near future. The major public accounting
firms are focused on providing their customers with the expertise needed
to deploy machine learning algorithms in businesses to accelerate
and improve business decisions while lowering costs. In May 2018,
PricewaterhouseCoopers announced a joint venture with eBravia, a contract
analytics software company, to develop machine learning algorithms for
contract analysis. Those algorithms could be used to review documents
related to lease accounting and revenue recognition standards as well as
other business activities, such as mergers and acquisitions, financings, and
divestitures. In the area of advisory services, Deloitte has advised retailers
on how they can enhance customer experience by using machine learning to
target product and services based on past buying patterns. While the major
public accounting firms may have the financial resources to invest in machine
learning, small public accounting firms can leverage on these technological
solutions and use pre-built machine learning algorithms to develop expertise
through their own implementations at a smaller scale.
DATA ANALYTICS
Predictive Analysis in Accounting Processes

Traditionally, accounting has focused more on fact-based, historical
reporting. While this perspective helps executives in analyzing historical
results so they can adjust their strategic and operational plans going forward,
it does not necessarily help them better predict and more aggressively plan
for the future.
Finding the right solution to enable a detailed analysis of financial data
is critical in the transition from looking at the historical financial data to
find predictors that enable forward-looking business intelligence (BI). A BI
solution leverages on patterns in your data. Looking at consolidated data
in an aggregate manner rather than in a piecemeal ad-hoc process from
separate information systems provides an opportunity to uncover hidden
trends and is a useful functionality for predictive analytics. For example, in
customer relationship management (CRM) systems, improved forecasting
is important in better planning for capacity peaks and troughs that directly
impact the customer experience, response time, and transaction volumes.
Many accountants are already using data analytics in their daily work.
They compute sums, averages, and percent changes to report sales results,
customer credit risk, cost per customer, and availability of inventory.
Accountants also are generally familiar with diagnostic analytics because
they perform variance analyses and use analytic dashboards to explain
historical results.
The various attempts to try to predict financial performance and
leveraging on nonfinancial performance measures that might be good
predictors of financial performance is expected to gain much traction in the
coming years. This presents a great opportunity for accountants to provide a
much more valuable role to management. Hence, accountants should further
harness the power of data analytics to effectively perform their roles.
Predictive analytics and prescriptive analytics are important because they
provide actionable insights for companies. Accountants need to increase their
competence in these areas to provide value to their organizations. Predictive
analytics integrates data from various sources (such as enterprise resource
planning, point-of-sale, and customer relationship management systems) to
predict future outcomes based on statistical relationships found in historical
data using regression-based modeling. One of the most common applications
of predictive analytics is the computation of a credit score to indicate the
likelihood of timely future credit payments. Prescriptive analytics utilizes
a combination of sophisticated optimization techniques (self-optimizing
algorithms) to make recommendations on the most favorable courses of
action to be taken.
The analytics skills that an accountant needs will differ depending on

whether the accounting professional will produce or consume information.
Analytics production includes sourcing relevant data and performing
analyses, which is more suitable for junior-level accountants. Analytics
consumption is using the insights gained from analytics in decision-making
and is more relevant for senior-level roles. It is not expected that accountants
need to retool to become data scientists or computer engineers to harness
analytics tools. Nevertheless, it is most important that the audit and accounting
professions become more proficient consumers of analytics to both enhance
their current audit practice with available technologies as well as to support
their client base in undertaking data analytics activities (Tschakert et al.,
2016).
Data Analytics Applications in Audit Processes

Audit Data Analytics (ADAs) help auditors discover and analyze patterns,
identify anomalies and extract other useful information from audit data
through analysis, modeling and visualization. Auditors can use ADAs to
perform a variety of procedures to gather audit evidence, to help with the
extraction of data and facilitate the use of audit data analytics, and a tool
to help illustrate where audit data analytics can be used in a typical audit
program (McQuilken, 2019).
Governance, risk and control, and compliance monitoring systems
commonly used by larger companies include systems developed by Oracle,
SAP and RSA Archer. Oracle and SAP have application-side business
intelligence systems centred on business warehouses. Lavastorm, Alteryx
and Microsoft’s SQL server provide advanced tools for specialists such as
business analysts and, increasingly, for non-specialists. All these platforms
are currently the preserve of large systems integrators, larger and mid-tier
firm consultancies and specialist data analysts. It seems likely though, that
over time these systems will move in-house or be provided as managed
services. It also seems likely that companies such as CaseWare and Validis
that currently provide data analytics services to larger and mid-tier firms,
enabling those firms to offer data analytics services to their own clients.
Some businesses already analyze their own data in a similar manner
to auditors. As these business analyses become deeper, wider, and more
sophisticated, with a focus on risk and performance, it seems likely that they
will align at least in part with the risks assessed by external auditors.
Data analytics is rooted in software originally developed in the early

2000s for data mining in the banking and retail sectors, and for design
and modelling in financial services and engineering. What is phenomenal
about this process is the volumes of data that can be handled efficiently
on an industrial scale, and the speed of calculations being performed in
a fraction of a second. The type of tasks such software can perform, and
the connections it can make, dwarf what was previously possible. These
technological improvements have facilitated the advances that we have seen
in data analytics software (Davenport, 2016).
Current Audit Use Cases

By using data analytics procedures, accountants and auditors can produce
high-quality, statistical forecasts that help them understand and identify
risks relating to the frequency and value of accounting transactions. Some
of these procedures are simple, others involve complex models. Auditors
using these models will exercise professional judgement to determine
mathematical and statistical patterns, helping them identify exceptions for
extended testing (Zabeti, 2019).
Auditors commonly use data analytics procedures to examine:
• Receivables and payables ageing;
• Analysis of gross margins and sales, highlighting items with
negative margins;
• Analysis of capital expenditure versus repairs and maintenance;
• Matching of orders and purchases;
• Testing of journal entries.
Although data analytics techniques may not entirely substitute the
traditional audit procedures and techniques, they can be powerful enablers
which allow auditors to perform procedures and analysis which were not
traditionally possible.
For example, a three-way match process is one of the most basic
procedures in audit. Traditionally, auditors perform this procedure by way of
sample testing as it is typically not realistic nor expected for auditors to vouch
all transaction documents. Data analytics techniques now provide auditors
the ability to analyze all the transactions which have been recorded. Hence,
auditors can potentially filter and identify a specific class of transactions
with unmatched items. Data analytics tools can also allow auditors to trace
revenue transactions to debtors and the subsequent cash received and also
analyze payments made after period end. This technique can relate the
subsequent payments with the delivery dates extracted from the underlying
delivery documents to ascertain if the payments relate to goods delivered
before the period end or after the period end and also determine the amount
of unrecorded liability.
DATA VISUALIZATION
Data Visualization in Accounting and Audit Processes

The auditing and accounting professions have allocated a large amount
of resources in understanding the impact of different data visualizations
techniques in decision making and analytical procedures. As the technology
evolves, and the size and volume of data is continuously growing, new ways
to present information are emerging, it is vital for accounting and auditing
research to examine newer data visualization techniques (Alawadhi, 2015).
The main objective of data visualization is to help users obtain better
insights, draw better conclusions and eventually create hypotheses. This is
achieved by integrating the user’s perceptual abilities to the data analysis
process, and applying their flexibility, creativity, and general knowledge to
the large data sets available in today’s systems. Data visualization involves
several main advantages. It presents data in a concise manner. It also allows
for faster data exploration in large data sets. Finally, data visualization tools
are intuitive and do not require an understanding of complex mathematical
or statistical algorithms.
New software is constantly being developed to help users work with
the ever-increasing volume of data produced by businesses. More and
more accounting firms and private businesses are using new BI tools such
as Tableau, Power BI and QlikSense (Eaton & Baader, 2018). Auditors
have begun to use visualizations as a tool to look at multiple accounts over
multiple years to detect misstatements. These tools can be used in risk
analysis, transaction and controls testing, analytical procedures, in support
of judgements and to provide insights. Many data analytics routines can now
easily be performed by auditors with little or no management involvement.
The ability to perform these analyses independently is important. Many
routines can be performed at a very detailed level. The higher-level routines
can be used for risk analysis to find a problem, while the more detailed
analysis can be used to provide audit evidence and/or insights.
Another promising feature of data visualization tools relates to an

audit engagement communication. With these tools, information can be
summarized and presented in a way that is attentional and sufficient. A reader
of the report will get the required information with a simple glance of a visual
presentation. It is possible that an opinion will be much more powerful if is
accompanied with a visualization of facts rather than statements describing
the factors to support the opinion. Introducing visualization techniques can
make reports easier to read and understand while focusing on the main
figures of what an auditor is trying to report. While analyzing data is the
crux of an external audit, it is critical that auditors know how to work with
data. Doing so ensures they will better understand their client and plan a
quality audit.
As the pace of innovation continues to increase, data visualization may
become a necessary part of the job for many accountants and auditors.
Accountants and auditors need to use vast amounts of data to not only report
on the past but also provide timely assurance and insights about the business’
future. There is a need to employ dynamic analytics or visualization tool
to increase the impact of their opinions and recommendations. Thus, it is
imperative that the accounting profession adopts and implements dynamic
reporting and visualization techniques that deal with the big-data problem
and produce results that enhance the ability to make an impact and gain
influence.
CONCLUSION
The use of automation, big data and other technological advances such as
machine learning will continue to grow in accounting and audit, producing
important business intelligence tools that provide historical, current and
predictive views of business operations in interactive data visualizations.
Business intelligence systems allow accounting professionals to make better
decisions by analyzing very large volumes of data from all lines of business,
resulting in increased productivity and accuracy and better insights to make
more informed decisions. The built-in, customizable dashboards allow for
real-time reporting and analysis, where exceptions, trends and opportunities
can be identified and transactional data drilled down for greater detail.
Analytics, artificial intelligence, and direct linkages to clients’
transaction systems can allow audits to be a continuous rather than an
annual process, and material misstatements and financial irregularities can
be detected in real time as they occur, providing near real-time assurance.
Audit team members could reduce the performance of repetitive low-level

tasks in verifying transactional data and be involved in high-value tasks
by focusing their efforts on the interpretation of the results produced by
machines. With adequate understanding of the wider business and economic
environment in which the client entity operates, from changes in technology
or competition, auditors are more able to assess the reasonableness of the
assumptions made by management, instead of just focusing on mechanical
details. Such improvements will enhance the application of professional
skepticism in framing auditor’s judgments when performing risk assessment
procedures and consequently, design an audit strategy and approach that
will be responsive to the assessed risks of material misstatement.
As audits become substantially more automated in the future, auditors
could also provide valuable insights to the clients, such as how the clients’
performances fare in comparison with similar companies on key metrics and
benchmarks, providing value-added services in addition to audit service.
Eventually, it will be the investing public who will benefit from higher
quality, more insightful audits powered by machine learning and big data
analysis across clients and industries.
ACKNOWLEDGEMENTS
Mui Kim Chu is senior lecturer at Singapore Institute of Technology. Kevin
Ow Yong is associate professor at Singapore Institute of Technology. We
wish to thank Khin Yuya Thet for her research assistance. All errors are our
own.
REFERENCES
1. Alawadhi, A. (2015). The Application of Data Visualization in
Auditing. Rutgers, The State University of New Jersey
2. Davenport, T. H. (2016). The Power of Advanced Audit Analytics
Everywhere Analytics. Deloitte Development LLC. https://www2.
deloitte.com/content/dam/Deloitte/us/Documents/deloitte-analytics/
us-da-advanced-audit-analytics.pdf
3. Dickey, G., Blanke, S., & Seaton, L. (2019). Machine Learning in
Auditing: Current and Future Applications. The CPA Journal, 89, 16-
21.
4. Eaton, T., & Baader, M. (2018). Data Visualization Software: An
Introduction to Tableau for CPAs. The CPA Journal, 88, 50-53.
5. Haq, I., Abatemarco, M., & Hoops, J. (2020). The Development of
Machine Learning and its Implications for Public Accounting. The
CPA Journal, 90, 6-9.
6. IAASB (2018). Exploring the Growing Use of Technology in the Audit,
with a Focus on Data Analytics. International Auditing and Assurance
Standards Board.
7. Lim, J. M., Lam, T., & Wang, Z. (2020). Using Data Analytics in a
Financial Statement Audit. IS Chartered Accountant Journal.
8. Martens, D., Bruynseels, L., Baesens, B., Willekens, M., & Vanthienen,
J. (2008). Predicting Going Concern Opinion with Data Mining.
Decision Support Systems, 45, 765-777. https://doi.org/10.1016/j.
dss.2008.01.003
9. McQuilken, D. (2019). 5 Steps to Get Started with Audit Data Analytics.
AICPA. https://blog.aicpa.org/2019/05/5-steps-to-get-started-with-
audit-data-analytics.html#sthash.NSlZVigi.dpbs
10. Paltrinieri, N., Comfort, L., & Reniers, G. (2019). Learning about Risk:
Machine Learning for Risk Assessment. Safety Science, 118, 475-486.
https://doi.org/10.1016/j.ssci.2019.06.001
11. Skapoullis, C. (2018). The Need for Data Visualisation. ICAEW.
https://www.icaew.com/technical/business-and-management/strategy-
risk-and-innovation/risk-management/internal-audit-resource-centre/
the-need-for-data-visualisation
12. Tschakert, N., Kokina, J., Kozlowski, S., & Vasarhelyi, M. (2016). The
Next Frontier in Data Analytics. Journal of Accountancy, 222, 58.
13. Zabeti, S. (2019). How Audit Data Analytics Is Changing Audit. Accru.
Chapter 5
Big Data Analytics in Immunology: A

Knowledge-Based Approach
Guang Lan Zhang1, Jing Sun2, Lou Chitkushev1, and Vladimir Brusic1
1
MA 02215, USA
2
Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical School, Bos-
ton, MA 02115, USA
ABSTRACT
With the vast amount of immunological data available, immunology
research is entering the big data era. These data vary in granularity, quality,
and complexity and are stored in various formats, including publications,
technical reports, and databases. The challenge is to make the transition
from data to actionable knowledge and wisdom and bridge the knowledge
gap and application gap. We report a knowledge-based approach based on
a framework called KB-builder that facilitates data mining by enabling
Citation: Guang Lan Zhang, Jing Sun, Lou Chitkushev, Vladimir Brusic, “Big Data An-
alytics in Immunology: A Knowledge-Based Approach”, BioMed Research Internation-
al, vol. 2014, Article ID 437987, 9 pages, 2014. https://doi.org/10.1155/2014/437987.
fast development and deployment of web-accessible immunological data

knowledge warehouses. Immunological knowledge discovery relies heavily
on both the availability of accurate, up-to-date, and well-organized data
and the proper analytics tools. We propose the use of knowledge-based
approaches by developing knowledgebases combining well-annotated
data with specialized analytical tools and integrating them into analytical
workflow. A set of well-defined workflow types with rich summarization
and visualization capacity facilitates the transformation from data to critical
information and knowledge. By using KB-builder, we enabled streamlining
of normally time-consuming processes of database development. The
knowledgebases built using KB-builder will speed up rational vaccine
design by providing accurate and well-annotated data coupled with tailored
computational analysis tools and workflow.
INTRODUCTION
Data represent the lowest level of abstraction and do not have meaning
by themselves. Information is data that has been processed so that it
gives answers to simple questions, such as “what,” “where,” and “when.”
Knowledge represents the application of data and information at a higher
level of abstraction, a combination of rules, relationships, ideas, and
experiences, and gives answers to “how” or “why” questions. Wisdom
is achieved when the acquired knowledge is applied to offer solutions to
practical problems. The data, information, knowledge, and wisdom (DIKW)
hierarchy summarizes the relationships between these levels, with data at
its base and wisdom at its apex and each level of the hierarchy being an
essential precursor to the levels above (Figure 1(a)) [1, 2]. The acquisition
cost is lowest for data acquisition and highest for knowledge and wisdom
acquisition (Figure 1(b)).
Big Data Analytics in Immunology: A Knowledge-Based Approach 103
Figure 1: The DIKW hierarchy. (a) The relative quantities of data, information,
knowledge, and wisdom. (b) The relative acquisition cost of the different layers.
(c) The gap between data and knowledge and (d) the gap between knowledge
and wisdom.
In immunology, for example, a newly sequenced molecular sequence
without functional annotation is a data point, information is gained by
annotating the sequence to answer questions such as which viral strain
it originates from, knowledge may be obtained by identifying immune
epitopes in the viral sequence, and the design of a peptide-based vaccine
using the epitopes represents the wisdom level. Overwhelmed by the vast
amount of immunological data, to make the transition from data to actionable
knowledge and wisdom and bridge the knowledge gap and application
gap, we are confronted with several challenges. These include asking the
“right questions,” handling unstructured data, data quality control (garbage
in, garbage out), integrating data from various sources in various formats,
and developing specialized analytics tools with the capacity to handle large
volume of data.
The human immune system is a complex system comprising the innate
immune system and the adaptive immune system. There are two branches of
adaptive immunity, humoral immunity effected by the antibodies and cell-
mediated immunity effected by the T cells of the immune system. In humoral
immunity, B cells produce antibodies for neutralization of extracellular
pathogens and their antigens that prevent the spread of infection. The
activation of B cells and their differentiation into antibody-secreting plasma

cells is triggered by antigens and usually requires helper T cells [3]. B cells
identify antigens through B-cell receptors, which recognize discrete sites on
the surface of target antigens called B-cell epitopes [4].
Cellular immunity involves the activation of phagocytes, antigen-specific
cytotoxic T-lymphocytes (CTLs), and the release of various cytokines in
response to pathogens and their antigens. T cells identify foreign antigens
through their T-cell receptors (TCRs), which interact with a peptide antigen
in complex with a major histocompatibility complex (MHC) molecule in
conjunction with CD4 or CD8 coreceptors [5, 6]. Peptides that induce immune
responses, when presented by MHC on the cell surface for recognition by
T cells, are called T-cell epitopes. CD8+ T cells control infection through
direct cytolysis of infected cells and through production of soluble antiviral
mediators. This function is mediated by linear peptide epitopes presented by
MHC class I molecules. CD4+ T cells recognize epitopes presented by MHC
class II molecules on the surface of infected cells and secrete lymphokines
that stimulate B cells and cytotoxic T cells. The Immune Epitope Database
(IEDB) [7] hosts nearly 20,000 T-cell epitopes as of Feb. 2014.
The recognition of a given antigenic peptide by an individual immune
system depends on the ability of this peptide to bind one or more of the host’s
human leukocyte antigens (HLA-human MHC). The binding of antigenic
peptides to HLA molecules is the most selective step in identifying T-cell
epitopes. There is a great diversity of HLA genes with more than 10,000
known variants characterized as of Feb. 2014 [8]. To manage this diversity,
the classification of HLA into supertypes was proposed to describe those
HLA variants that have small differences in their peptide-binding grooves
and share similar peptide-binding specificities [9, 10]. Peptides that can
bind multiple HLA variants are termed “promiscuous peptides.” They are
suitable for the design of epitope-based vaccines because they can interact
with multiple HLA within human populations.
The concept of reverse vaccinology supports identification of vaccine
targets by large-scale bioinformatics screening of entire pathogenic genomes
followed by experimental validation [11]. Using bioinformatics analysis to
select a small set of key wet-lab experiments for vaccine design is becoming
a norm. The complexity of identification of broadly protective vaccine
targets arises from two principal sources, the diversity of pathogens and
the diversity of human immune system. The design of broadly protective
peptide-based vaccines involves the identification and selection of vaccine
targets composed of conserved T-cell and B-cell epitopes that are broadly
cross-reactive to viral subtypes and protective of a large host population
(Figure 2).
Figure 2: The process of rational vaccine discovery using knowledge-based

systems. The design of broadly protective peptide-based vaccines involves
identification and selection of vaccine targets composed of conserved T-cell
and B-cell epitopes that are broadly cross-reactive to pathogen subtypes and
protective of a large host population.
Fuelled by the breakthroughs in genomics and proteomics and advances in
instrumentation, sample processing, and immunological assays, immunology
research is entering the big data era. These data vary in granularity, quality,
and complexity and are stored in various formats, including publications,
technical reports, and databases. Next generation sequencing technologies
are shifting the paradigm of genomics and allowing researchers to perform
genome-wide studies [12]. It was estimated that the amount of publically
available genomic data will grow from petabytes (1015) to exabytes (1018)
[13]. Mass spectrometry (MS) is the method for detection and quantitation
of proteins. The technical advancements in proteomics support exponential
growth of the numbers of characterized protein sequences. It is estimated
that more than 2 million protein variants make the posttranslated human
proteome in any human individual [14]. Capitalizing on the recent advances
in immune profiling methods, the Human Immunology Project Consortium
(HIPC) is creating large data sets on human subjects undergoing influenza
vaccination or who are infected with pathogens including influenza virus,
West Nile virus, herpes zoster, pneumococcus, and the malaria parasite [15].
Systems biology aims to study the interactions between relevant molecular
components and their changes over time and enable the development of
predictive models. The advent of technological breakthroughs in the fields of
genomics, proteomics, and other “omics” is catalyzing advances in systems
immunology, a new field under the umbrella of system biology [16]. The
synergy between systems immunology and vaccinology enables rational
vaccine design [17].
Big data describes the environment where massive data sources combine
both structured and unstructured data so that the analysis cannot be performed
using traditional database and analytical methods. Increasingly, data sources
from literature and online sources are combined with the traditional types
of data [18] for summarization of complex information, extraction of
knowledge, decision support, and predictive analytics. With the increase of
the data sources, both the knowledge and application gaps (Figures 1(c) and
1(d)) keep widening and the corresponding volumes of data and information
are rapidly increasing. We describe a knowledge-based approach that helps
reduce the knowledge and application gaps for applications in immunology
and vaccinology.
MATERIALS AND METHODS

In the big data era, knowledge-based systems (KBSs) are emerging as
knowledge discovery platforms. A KBS is an intelligent system that employs
a computationally tractable knowledgebase or repository in order to reason
upon data in a targeted domain and reproduce expert performance relative
to such reasoning operations [19]. The goal of a KBS is to increase the
reproducibility, scalability, and accessibility of complex reasoning tasks
[20]. Some of the web-accessible immunological databases, such as Cancer
Immunity Peptide Database that hosts four static data tables containing four
types of tumor antigens with defined T-cell epitopes, focus on cataloging the
data and information and pay little attention to the integration of analysis
tools [21, 22]. Most recent web-accessible immunological databases,
such as Immune Epitope Database (IEDB) that catalogs experimentally
characterized B-cell and T-cell epitopes and data on MHC binding and MHC
ligand elution experiments, started to integrate some data analysis tools [7,
23]. To bridge the knowledge gap between immunological information and
knowledge, we need KBSs that tightly integrate data with analysis tools to
enable comprehensive screening of immune epitopes from a comprehensive
landscape of a given disease (such as influenza, flaviviruses, or cancer), the

analysis of crossreactivity and crossprotection following immunization or
vaccination, and prediction of neutralizing immune responses. We developed
a framework called KB-builder to facilitate data mining by enabling fast
development and deployment of web-accessible immunological data
knowledge warehouses. The framework consists of seven major functional
modules (Figure 3), each facilitating a specific aspect of the knowledgebase
construction process. The KB-builder framework is generic and can be
applied to a variety of immunological sequence datasets. Its aim is to enable
the development of a web-accessible knowledgebase and its corresponding
analytics pipeline within a short period of time (typically within 1-2 weeks),
given a set of annotated genetic or protein sequences.
Figure 3: The structure of KB-builder.

The design of a broadly protective peptide-based vaccine against
viral pathogens involves the identification and selection of vaccine targets
composed of conserved T-cell and B-cell epitopes that are broadly cross-
reactive to a wide range of viral subtypes and are protective in a large majority
of host population (Figure 2). The KB-builder facilitates a systematic
discovery of vaccine targets by enabling fast development of specialized
bioinformatics KBS that tightly integrate the content (accurate, up-to-date,
and well-organized antigen data) with tailored analysis tools.
The input to KB-builder is data scattered across primary databases and
scientific literature (Figure 3). Module 1 (data collection and processing
module) performs automated data extraction and initial transformations.
The raw antigen data (viral or tumor) consisting of protein or nucleotide
sequences, or both, and their related information are collected from various
sources. The collected data are then reformatted and organized into a
unified XML format. Module 2 (data cleaning, enrichment, and annotation
module) deals with data incompleteness, inconsistency, and ambiguities
due to the lack of submission standards in the online primary databases.
The semiautomated data cleaning is performed by domain experts to ensure
data quality, completeness, and redundancy reduction. Semiautomated data
enrichment and annotation are performed by the domain experts further
enhancing data quality. The semiautomation involves automated comparison
of new entries to the entries already processed within the KB and comparison
of terms that are entered into locally implemented dictionaries. Terms that
match the existing record annotations and dictionary terms are automatically
processed. New terms and new annotations are inspected by a curator and
if in error they are corrected, or if they represent novel annotations or
terms they are added to the knowledgebase and to the local dictionaries.
Module 3 (the import module) performs automatic import of the XML file
into the central repository. Module 4 (the basic analysis toolset) facilitates
fast integration of common analytical tools with the online antigen KB.
All our knowledgebases have the basic keyword search tools for locating
antigens and T-cell epitopes or HLA ligands. The advanced keyword search
tool was included in FLAVIdB, FLUKB, and HPVdB, where users further
restrict the search by selecting virus species, viral subtype, pathology, host
organism, viral strain type, and several other filters. Other analytical tools
include sequence similarity search enabled by basic local alignment search
tool (BLAST) [24] and color-coded multiple sequence alignment (MSA)
tool [25] on user-defined sequence sets as shown in Figure 4. Module 5 (the
specialized analysis toolset) facilitates fast integration of specialized analysis
tools designed according to the specific purpose of the knowledgebase and
the structural and functional properties of the source of the sequences. To
facilitate efficient antigenicity analysis, in every knowledgebase and within
each antigen entry, we embedded a tool that performs on-the-fly binding
prediction to 15 frequent HLA class I and class II alleles. In TANTIGEN,
an interactive visualization tool, mutation map, has been implemented to
provide a global view of all mutations reported in a tumor antigen. Figure
5 shows a screenshot of mutation map of tumor antigen epidermal growth
factor receptor (EGFR) in TANTIGEN. In TANTIGEN and HPVdB, a
T-cell epitope visualization tool has been implemented to display epitopes
in all isoforms of a tumor antigen or sequences of a HPV genotype. The
B-cell visualization tool in FLAVIdB and FLUKB displays neutralizing
B-cell epitope positions on viral protein three-dimensional (3D) structures

[26, 27]. To analyze viral sequence variability, given a MSA of a set of
sequences, a tool was developed to calculate Shannon entropy at each
alignment position. To identify conserved T-cell epitopes that cover the
majority of viral population, we developed and integrated block entropy
analysis tool in FLAVIdB and FLUKB to analyze peptide conservation and
variability. We developed a novel sequence logo tool, BlockLogo, optimized
for visualization of continuous and discontinuous motifs, fragments [28, 29].
When paired with the HLA binding prediction tool, BlockLogo is a useful
tool for rapid assessing of immunological potential of selected regions in a
MSA, such as alignments of viral sequences or tumor antigens.
Figure 4: A screenshot of the result page generated by the color-coded MSA

tool implemented in the FLAVIdB. The residues are color-coded by frequency:
white (100%), cyan (second most frequent), yellow (third most frequent resi-
dues), gray (fourth most frequent residues), green (fifth most frequent residues),
purple (sixth most frequent residues), and blue (everything less frequent than
the sixth most frequent residues).
Figure 5: A screenshot of mutation map of tumor antigen epidermal growth

factor receptor (EGFR) in TANTIGEN. The numbers are the amino acid posi-
tions in the antigen sequence and the top amino acid sequence is the reference
sequence of EGFR. The highlighted amino acids in the reference sequences are
positions where point mutations took place. Clicking on the amino acids below
the point mutation positions links to the mutated sequence data table.
A workflow is an automated process that takes a request from the user,
performs complex analysis by combining data and tools preselected for
common questions, and produces a comprehensive report [30]. Module 6
(workflow for integrated analysis to answer meaningful questions) automates
the consecutive execution of multiple analysis steps, which researchers
usually would have to perform manually, to answer complex sequential
questions. Two workflow types, the summary workflow and the query
analyzer workflow, were implemented in FLAVIdB. Three workflow types,
the vaccine target workflow, the crossneutralization estimation workflow,
and B-cell epitope mapper workflow, were implemented in FLUKB. Module
7 (semiautomated update and maintenance of the databases) employs a
semiautomated approach to maintain and update the databases.
RESULTS AND DISCUSSION

Using the KB-builder, we built several immunovaccinology knowledgebases
including TANTIGEN: Tumor T-cell Antigen Database (http://cvc.dfci.
harvard.edu/tadb/), FLAVIdB: Flavivirus Antigen Database [31], HPVdB:
Human Papillomavirus T-cell Antigen Database [32], FLUKB: Flu Virus
Antigen Database (http://research4.dfci.harvard.edu/cvc/flukb/), Epstein-
Barr Virus T-cell Antigen Database (http://research4.dfci.harvard.edu/
cvc/ebv/), and Merkel Cell Polyomavirus Antigen Database (http://cvc.
dfci.harvard.edu/mcv/). These knowledgebases combine virus and tumor
antigenic data, specialized analysis tools, and workflow for automated

complex analyses focusing on applications in immunology and vaccinology.
The Human Papillomavirus T-cell Antigen Database (HPVdB) contains
2781 curated antigen entries of antigenic proteins derived from 18 genotypes
of high-risk HPV and 18 genotypes of low-risk HPV. It also catalogs 191
verified T-cell epitopes and 45 verified HLA ligands. The functions of the
data mining tools integrated in HPVdB include antigen and epitope/ligand
search, sequence comparison using BLAST search, multiple alignments of
antigens, classification of HPV types based on cancer risk, T-cell epitope
prediction, T-cell epitope/HLA ligand visualization, T-cell epitope/HLA
ligand conservation analysis, and sequence variability analysis.
HPV regulatory proteins E6 and E7 proteins are often studied for immune-
based therapies as they are constitutively expressed in HPV-associated
cancer cells. First, the prediction of A*0201 binding peptides (both 9-mers
and 10-mers) of HPV16 E6 and E7 proteins was performed computationally.
Based on the prediction results, 21 peptides were synthesized and ten of
them were identified as binders using an A*0201 binding assay. The ten
A*0201-binding peptides were further tested for immune recognition in
peripheral blood mononuclear cells isolated from six A*0201-positive
healthy donors using interferon γ (IFN γ) ELISpot assay. Two peptides, E711–
19
and E629–38, elicited spot-forming-unit numbers 4-5-fold over background
in one donor. Finally, mass spectrometry was used to validate that peptide
E711–19 is naturally presented on HPV16-transformed, A*0201-positive
cells. Using the peptide conservation analysis tool embedded in HPVdB,
we answered the question how many HPV strains contain this epitope.
The epitope E711–19 is conserved in 16 of 17 (94.12% conserved) HPV16
E7 complete sequences (Figure 6). A single substitution mutation L15V in
HPV001854 (UniProt ID: C0KXQ5) resulted in the immune escape. Among
the 35 HPV16 cervical cancer samples we analyzed, only a single sample
contained the HPV001854 sequence variant. The conserved HPV T-cell
epitopes displayed by HPV transformed tumors such as E711–19 may be the
basis of a therapeutic T-cell based cancer vaccine. This example shows the
combination of bioinformatics analysis and experimental validation leading
to identification of suitable vaccine targets [33, 34].
Figure 6: A screenshot of the conservation analysis result page of T-cell epitope

E711–19 in HPVdB.
Flaviviruses, such as dengue and West Nile viruses, are NIAID Category
A and B Priority Pathogens. We developed FLAVIdB that contains 12,858
entries of flavivirus antigen sequences, 184 verified T-cell epitopes, 201
verified B-cell epitopes, and 4 representative molecular structures of the
dengue virus envelope protein [31]. The data mining system integrated in
FLAVIdB includes tools for antigen and epitope/ligand search, sequence
comparison using BLAST search, multiple alignments of antigens, variability
and conservation analysis, T-cell epitope prediction, and characterization of
neutralizing components of B-cell epitopes. A workflow is an automated
process that takes a request from the user, performs complex analysis by
combining data and tools preselected for common questions, and produces a
comprehensive report to answer a specific research question. Two predefined
analysis workflow types, summary workflow and query analyzer workflow,
were implemented in FLAVIdB [31].
Broad coverage of the pathogen population is particularly important
when designing T-cell epitope vaccines against viral pathogens. Using
FLAVIdB we applied the block entropy analysis method to the proteomes
of the four serotypes of dengue virus (DENV) and found 1,551 blocks of
9-mer peptides, which cover 99% of available sequences with five or fewer
unique peptides [35]. Many of the blocks are located consecutively in the
proteins, so connecting these blocks resulted in 78 conserved regions which
can be covered with 457 subunit peptides. Of the 1551 blocks of 9-mer
peptides, 110 blocks consisted of peptides all predicted to bind to MHC with
similar affinity and the same HLA restriction. In total, we identified a pool
of 333 peptides as T-cell epitope candidates. This set could form the basis
for a broadly neutralizing dengue virus vaccine. The results of block entropy
analysis of dengue subtypes 1–4 from FLAVIdB are shown in Figure 7.
Figure 7: Block entropy analysis of envelope proteins of dengue subtypes 1–4

in the FLAVIdB. (a) A screenshot of the input page of block entropy analysis in
the FLAVIdB. (b) The number of blocks needed to cover 99% of the sequences
variation. -axis is the starting positions of blocks and -axis is the number of
blocks required. The blocks with gap fraction above 10% are not plotted.
Influenza virus is a NIAID Category C Priority Pathogen. We developed
the FLUKB that currently contains 302,272 influenza viral protein sequence
entries from 62,016 unique strains (57,274 type A, 4,470 type B, 180 type
C, and 92 unknown types) of influenza virus. It also catalogued 349 unique
T-cell epitopes, 708 unique MHC binding peptides, and 17 neutralizing
antibodies against hemagglutinin (HA) proteins along with their 3D
structures. The detailed information on the neutralizing antibodies such
as isolation information, experimentally validated neutralizing/escape
influenza strains, B-cell epitope on the 3D structures, are also provided.
Approximately 10% of B-cell epitopes are linear peptides, while 90% are
formed from discontinuous amino acids that create surface patches resulting
from 3D folding of proteins [36]. Characterization of an increasing number
of broadly neutralizing antibodies specific for pathogen surface proteins, the
growing number of known 3D structures of antigen-neutralizing antibody
complexes, and the rapid growth of the number of viral variant sequences
demand systematic bioinformatics analyses of B-cell epitopes and cross-
reactivity of neutralizing antibodies. We developed a generic method for the
assessment of neutralizing properties of monoclonal antibodies. Previously,
dengue virus was used to demonstrate a generalized method [27]. This

methodology has direct relevance to the characterization and the design of
broadly neutralizing vaccines.
Using the FLUKB, we employed the analytical methods to estimate cross-
reactivity of neutralizing antibodies (nAbs) against surface glycoprotein
HA of influenza virus strains, both newly emerging or the existing ones
[26]. We developed a novel way of describing discontinuous motifs as
virtual peptides to represent B-cell epitopes and to estimate potential cross-
reactivity and neutralizing coverage of these epitopes. Strains labelled
as potentially cross-reactive are those that share 100% identity of B-cell
epitopes with experimentally verified neutralized strains. Two workflow
types were implemented in the FLUKB for cross-neutralization analysis:
cross-neutralization estimation workflow and B-cell epitope mapper
workflow.
The cross-neutralization estimation workflow estimates the cross-
neutralization coverage of a validated neutralizing antibody using all full-
length sequences of HA hosted in the FLUKB, or using full-length HA
sequences of a user-defined subset by restricting year ranges, subtypes, or
geographical locations. Firstly, a MSA is generated using the full-length
HA sequences. The resulting MSA provides a consistent alignment position
numbering scheme for the downstream analyses. Secondly, for each nAb, the
HA sequence from its 3D structure and from the experimentally validated
strains is used to search for a strain with the highest similarity in FLUKB
using BLAST. Thirdly, a B-cell epitope is identified from the validated
antigen-antibody structures based on the calculation of accessible surface
area and atom distance. Fourthly, using the MSA and the alignment position
numbering, the residue position of the B-cell epitope is mapped onto the HA
sequences of validated strains to get B-cell epitope motifs. Discontinuous
motifs are extracted from all the HA sequences in the MSA and compared
to the B-cell epitope motif. According to the comparison results, they are
classified to be either neutralizing if identical to a neutralizing discontinuous
motif, escape if identical to an escape discontinuous motif, or not validated if
no identical match was found. The cross-neutralization coverage estimation
of neutralizing antibody F10 on all HA sequences from FLUKB is shown
in Figure 8.
Figure 8: (a) Sequence logo of neutralizing epitopes by neutralizing antibody

F10 on influenza virus HA protein. (b) BlockLogo of the discontinuous residues
in F10 neutralizing epitope. (c) The structure of influenza A HA protein with
neutralizing antibody F10 (PDB ID:3FKU) and the conformational epitope
shown in pink. (d) Discontinuous epitope on HA protein recognized by F10.
For a newly emerged strain, the B-cell epitope mapper workflow
performs in silico prediction of its cross-neutralization based on existing
nAbs and provides preliminary results for the design of downstream
validation experiments. Firstly, a discontinuous peptide is extracted from its
HA sequence according to positions on each known B-cell epitope. Secondly,
sequence similarity comparison is conducted between the discontinuous
motifs and all known B-cell epitopes from experimentally validated strains.
The motifs identical to the known neutralized or escape B-cell epitope motifs
are proposed as neutralized or escape strains, respectively.
The cross-neutralization estimation workflow provides an overview of
cross-neutralization of existing neutralizing antibodies, while B-cell epitope
mapper workflow gives an estimation of possible neutralizing effect of new
viral strains using known neutralizing antibodies. This knowledge-based
approach improves our understanding of antibody/antigen interactions,
facilitates mapping of the known universe of target antigens, allows the
prediction of cross-reactivity, and speeds up the design of broadly protective
influenza vaccines.
CONCLUSIONS
The big data analytics applies advanced analytic methods to data sets that
are very large and complex and that include diverse data types. These
advanced analytics methods include predictive analytics, data mining, text
mining, integrated statistics, visualization, and summarization tools. The
data sets used in our case studies are complex and the analytics is achieved
through the definition of workflow. Data explosion in our case studies is
fueled by the combinatorial complexity of the domain and the disparate
data types. The cost of analysis and computation increases exponentially
as we combine various types of data to answer research questions. We use
the in silico identification of influenza T-cell epitopes restricted by HLA
class I variants as an example. There are 300,000 influenza sequences to
be analyzed for T-cell epitopes using MHC binding prediction tools based
on artificial neural networks or support vector machines [37–40]. Based on
the DNA typing for the entire US donor registry, there are 733 HLA-A, 921
HLA-B, and 429 HLA-C variants, a total of 2083 HLA variants, observed in
US population [41]. These alleles combine into more than 45,000 haplotypes
(combinations of HLA-A, -B, and -C) [41]. Each of these haplotypes has
different frequencies and distributions across different populations. The
in silico analysis of MHC class I restricted T-cell epitopes includes MHC
binding prediction of all overlapping peptides that are 9–11 amino acids
long. This task alone involves a systematic analysis of 300,000 sequences
that are on average 300 amino acids long. Therefore, the total number of
in silico predictions is approximately 300,000 × 300 × 3 × 2083 (number
of sequences times the average length of each sequence times 3 times the
number of observed HLA variants) or a total of 5.6 × 1011 calculations.
Predictive models do not exist for all HLA alleles, so some analysis needs
to be performed by analysis of similarity of HLA molecules and grouping

them in clusters that share binding properties. For B-cell epitope analysis,
the situation is similar, except that the methods involve the analysis of 3D
structures of antibodies and the analysis of nearly 100,000 sequences of HA
and neuraminidase (NA) and their cross-comparison for each neutralizing
antibody. A rich set of visualization tools is needed to report population data
and distributions across populations. For vaccine studies, these data need to
be analyzed together with epidemiological data including transmissibility
and severity of influenza viruses [42]. These functional properties can be
assigned to each influenza strain and the analysis can be performed for their
epidemic and pandemic potential. These numbers indicate that the analytics
methods involve a large amount of calculations that cannot be performed
using brute force approaches.
Immunological knowledge discovery relies heavily on both the
availability of accurate, up-to-date, and well-organized data and the proper
analytics tools. We propose the use of knowledge-based approaches by
developing knowledgebases combining well-annotated data with specialized
analytical tools and integrating them into analytical workflow. A set of
well-defined workflow types with rich summarization and visualization
capacity facilitates the transformation from data to critical information and
knowledge. By using KB-builder, we enabled streamlining of normally
time-consuming process of database development. The knowledgebases
built using KB-builder will speed up rational vaccine design by providing
accurate and well-annotated data coupled with tailored computational
analysis tools and workflow.
REFERENCES
1. J. Rowley, “The wisdom hierarchy: representations of the DIKW
hierarchy,” Journal of Information Science, vol. 33, no. 2, pp. 163–
180, 2007.
2. R. Ackoff, “From data to wisdom,” Journal of Applies Systems
Analysis, vol. 16, no. 1, pp. 3–9, 1989.
3. C. Janeway, Immunobiology: The Immune System in Health and
Disease, Garland Science, New York, NY, USA, 6th edition, 2005.
4. M. H. V. van Regenmortel, “What is a B-cell epitope?” Methods in
Molecular Biology, vol. 524, pp. 3–20, 2009.
5. S. C. Meuer, S. F. Schlossman, and E. L. Reinherz, “Clonal analysis
of human cytotoxic T lymphocytes: T4+ and T8+ effector T cells
recognize products of different major histocompatibility complex
regions,” Proceedings of the National Academy of Sciences of the
United States of America, vol. 79, no. 14 I, pp. 4395–4399, 1982.
6. J. H. Wang and E. L. Reinherz, “Structural basis of T cell recognition
of peptides bound to MHC molecules,” Molecular Immunology, vol.
38, no. 14, pp. 1039–1049, 2002.
7. R. Vita, L. Zarebski, J. A. Greenbaum et al., “The immune epitope
database 2.0,” Nucleic Acids Research, vol. 38, supplement 1, pp.
D854–D862, 2009.
8. J. Robinson, J. A. Halliwell, H. McWilliam, R. Lopez, P. Parham, and
S. G. E. Marsh, “The IMGT/HLA database,” Nucleic Acids Research,
vol. 41, no. 1, pp. D1222–D1227, 2013.
9. A. Sette and J. Sidney, “Nine major HLA class I supertypes account
for the vast preponderance of HLA-A and -B polymorphism,”
Immunogenetics, vol. 50, no. 3-4, pp. 201–212, 1999.
10. O. Lund, M. Nielsen, C. Kesmir et al., “Definition of supertypes for HLA
molecules using clustering of specificity matrices,” Immunogenetics,
vol. 55, no. 12, pp. 797–810, 2004.
11. R. Rappuoli, “Reverse vaccinology,” Current Opinion in Microbiology,
vol. 3, no. 5, pp. 445–450, 2000.
12. D. C. Koboldt, K. M. Steinberg, D. E. Larson, R. K. Wilson, and E. R.
Mardis, “The next-generation sequencing revolution and its impact on
genomics,” Cell, vol. 155, no. 1, pp. 27–38, 2013.
13. D. R. Zerbino, B. Paten, and D. Haussler, “Integrating genomes,”

Science, vol. 336, no. 6078, pp. 179–182, 2012.
14. M. Uhlen and F. Ponten, “Antibody-based proteomics for human tissue
profiling,” Molecular and Cellular Proteomics, vol. 4, no. 4, pp. 384–
393, 2005.
15. V. Brusic, R. Gottardo, S. H. Kleinstein, and M. M. Davis,
“Computational resources for high-dimensional immune analysis from
the human immunology project consortium,” Nature Biotechnology,
vol. 32, no. 2, pp. 146–148, 2014.
16. A. Aderem, “Editorial overview: system immunology,” Seminars in
Immunology, vol. 25, no. 3, pp. 191–192, 2013.
17. S. Li, H. I. Nakaya, D. A. Kazmin, J. Z. Oh, and B. Pulendran, “Systems
biological approaches to measure and understand vaccine immunity in
humans,” Seminars in Immunology, vol. 25, no. 3, pp. 209–218, 2013.
18. L. Olsen, U. J. Kudahl, O. Winther, and V. Brusic, “Literature classification
for semi-automated updating of biological knowledgebases,” BMC
Genomics, vol. 14, supplement 5, article S14, 2013.
19. P. R. O. Payne, “Chapter 1: biomedical knowledge integration,” PLoS
Computational Biology, vol. 8, no. 12, Article ID e1002826, 2012.
20. S.-H. Liao, P.-H. Chu, and P.-Y. Hsiao, “Data mining techniques and
applications—a decade review from 2000 to 2011,” Expert Systems
with Applications, vol. 39, no. 12, pp. 11303–11311, 2012.
21. N. Vigneron, V. Stroobant, B. J. van den Eynde, and P. van der Bruggen,
“Database of T cell-defined human tumor antigens: the 2013 update,”
Cancer Immunity, vol. 13, article 15, 2013.
22. B. J. van den Eynde and P. van der Bruggen, “T cell defined tumor
antigens,” Current Opinion in Immunology, vol. 9, no. 5, pp. 684–693,
1997.
23. B. Peters, J. Sidney, P. Bourne et al., “The design and implementation of
the immune epitope database and analysis resource,” Immunogenetics,
vol. 57, no. 5, pp. 326–336, 2005.
24. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman,
“Basic local alignment search tool,” Journal of Molecular Biology,
vol. 215, no. 3, pp. 403–410, 1990.
25. K. Katoh, K. Misawa, K. Kuma, and T. Miyata, “MAFFT: a novel
method for rapid multiple sequence alignment based on fast Fourier
transform,” Nucleic Acids Research, vol. 30, no. 14, pp. 3059–3066,
2002.
26. J. Sun, U. J. Kudahl, C. Simon, Z. Cao, E. L. Reinherz, and V.
Brusic, “Large-scale analysis of B-cell epitopes on influenza virus
hemagglutinin—implications for cross-reactivity of neutralizing
antibodies,” Frontiers in Immunology, vol. 5, article 38, 2014.
27. J. Sun, G. L. Zhang, L. R. Olsen, E. L. Reinherz, and V. Brusic,
“Landscape of neutralizing assessment of monoclonal antibodies
against dengue virus,” in Proceedings of the International Conference
on Bioinformatics, Computational Biology and Biomedical Informatics
(BCB ‘13), p. 836, Washington, DC, USA, 2013.
28. G. E. Crooks, G. Hon, J. Chandonia, and S. E. Brenner, “WebLogo: a
sequence logo generator,” Genome Research, vol. 14, no. 6, pp. 1188–
1190, 2004.
29. L. R. Olsen, U. J. Kudahl, C. Simon et al., “BlockLogo: visualization of
peptide and sequence motif conservation,” Journal of Immunological
Methods, vol. 400-401, pp. 37–44, 2013.
30. J. Söllner, A. Heinzel, G. Summer et al., “Concept and application of
a computational vaccinology workflow,” Immunome Research, vol. 6,
supplement 2, article S7, 2010.
31. L. R. Olsen, G. L. Zhang, E. L. Reinherz, and V. Brusic, “FLAVIdB: a
data mining system for knowledge discovery in flaviviruses with direct
applications in immunology and vaccinology,” Immunome Research,
vol. 7, no. 3, pp. 1–9, 2011.
32. G. L. Zhang, A. B. Riemer, D. B. Keskin, L. Chitkushev, E. L. Reinherz,
and V. Brusic, “HPVdb: a data mining system for knowledge discovery
in human papillomavirus with applications in T cell immunology and
vaccinology,” Database, vol. 2014, Article ID bau031, 2014.
33. A. B. Riemer, D. B. Keskin, G. Zhang et al., “A conserved E7-derived
cytotoxic T lymphocyte epitope expressed on human papillomavirus
16-transformed HLA-A2+ epithelial cancers,” Journal of Biological
Chemistry, vol. 285, no. 38, pp. 29608–29622, 2010.
34. D. B. Keskin, B. Reinhold, S. Lee et al., “Direct identification of
an HPV-16 tumor antigen from cervical cancer biopsy specimens,”
Frontiers in Immunology, vol. 2, article 75, 2011.
35. L. R. Olsen, G. L. Zhang, D. B. Keskin, E. L. Reinherz, and V. Brusic,
“Conservation analysis of dengue virust-cell epitope-based vaccine
candidates using peptide block entropy,” Frontiers in Immunology,

vol. 2, article 69, 2011.
36. J. Huang and W. Honda, “CED: a conformational epitope database,”
BMC Immunology, vol. 7, article 7, 2006.
37. E. Karosiene, M. Rasmussen, T. Blicher, O. Lund, S. Buus, and M.
Nielsen, “NetMHCIIpan-3.0, a common pan-specific MHC class II
prediction method including all three human MHC class II isotypes,
HLA-DR, HLA-DP and HLA-DQ,” Immunogenetics, vol. 65, no. 10,
pp. 711–724, 2013.
38. I. Hoof, B. Peters, J. Sidney et al., “NetMHCpan, a method for MHC
class i binding prediction beyond humans,” Immunogenetics, vol. 61,
no. 1, pp. 1–13, 2009.
39. G. L. Zhang, I. Bozic, C. K. Kwoh, J. T. August, and V. Brusic,
“Prediction of supertype-specific HLA class I binding peptides using
support vector machines,” Journal of Immunological Methods, vol.
320, no. 1-2, pp. 143–154, 2007.
40. G. L. Zhang, A. M. Khan, K. N. Srinivasan, J. T. August, and V.
Brusic, “Neural models for predicting viral vaccine targets,” Journal
of Bioinformatics and Computational Biology, vol. 3, no. 5, pp. 1207–
1225, 2005.
41. L. Gragert, A. Madbouly, J. Freeman, and M. Maiers, “Six-locus high
resolution HLA haplotype frequencies derived from mixed-resolution
DNA typing for the entire US donor registry,” Human Immunology,
vol. 74, no. 10, pp. 1313–1320, 2013.
42. C. Reed, M. Biggerstaff, L. Finelli et al., “Novel framework for
assessing epidemiologic effects of influenza epidemics and pandemics,”
Emerging Infectious Diseases, vol. 19, no. 1, pp. 85–91, 2013.
SECTION 2: BIG DATA METHODS
Chapter 6
Integrated Real-Time Big Data Stream

Sentiment Analysis Service
Sun Sunnie Chung, Danielle Aring

Department of Electrical Engineering and Computer Science, Cleveland State Univer-
sity, Cleveland, USA
ABSTRACT
Opinion (sentiment) analysis on big data streams from the constantly
generated text streams on social media networks to hundreds of millions
of online consumer reviews provides many organizations in every field
with opportunities to discover valuable intelligence from the massive user
generated text streams. However, the traditional content analysis frameworks
are inefficient to handle the unprecedentedly big volume of unstructured
text streams and the complexity of text analysis tasks for the real time
opinion analysis on the big data streams. In this paper, we propose a parallel
real time sentiment analysis system: Social Media Data Stream Sentiment
Analysis Service (SMDSSAS) that performs multiple phases of sentiment
analysis of social media text streams effectively in real time with two fully
analytic opinion mining models to combat the scale of text data streams
Citation: Chung, S. and Aring, D. (2018), “Integrated Real-Time Big Data Stream Sen-
timent Analysis Service”. Journal of Data Analysis and Information Processing, 6, 46-
66. doi: 10.4236/jdaip.2018.62004.
and the complexity of sentiment analysis processing on unstructured text

streams. We propose two aspect based opinion mining models: Deterministic
and Probabilistic sentiment models for a real time sentiment analysis on
the user given topic related data streams. Experiments on the social media
Twitter stream traffic captured during the pre-election weeks of the 2016
Presidential election for real-time analysis of public opinions toward two
presidential candidates showed that the proposed system was able to predict
correctly Donald Trump as the winner of the 2016 Presidential election. The
cross validation results showed that the proposed sentiment models with
the real-time streaming components in our proposed framework delivered
effectively the analysis of the opinions on two presidential candidates
with average 81% accuracy for the Deterministic model and 80% for the
Probabilistic model, which are 1% - 22% improvements from the results of
the existing literature.
Keywords: Sentiment Analysis, Real-Time Text Analysis, Opinion Analy-

sis, Big Data An-alytics
INTRODUCTION
In the era of the web based social media, user-generated contents in “any”
form of user created content including: blogs, wikis, forums, posts, chats,
tweets, or podcasts have become the norm of media to express people’s
opinion. The amounts of data generated by individuals, businesses,
government, and research agents have undergone exponential growth. Social
networking giants such as Facebook and Twitter had 1.86 and 0.7 billion
active users as of Feb. 2018. The user-generated texts are valuable resources
to discover useful intelligence to help people in any field to make critical
decisions. Twitter has become an important platform of user generated text
streams where people express their opinions and views on new events,
new products or news. Such new events or news from announcing political
parties and candidates for elections to a popular new product release are
often followed almost instantly by a burst in Twitter volume, providing a
unique opportunity to measure the relationship between expressed public
sentiment and the new events or the new products.
Sentiment analysis can help explore how these events affect public
opinion or how public opinion affects future sales of these new products.
While traditional content analysis takes days or weeks to complete, opinion
Integrated Real-Time Big Data Stream Sentiment Analysis Service 127
analysis of such streaming of large amounts of user-generated text have

commanded research and development of a new generation of analytics
methods and tools to process them in real-time or near-real time effectively.
Big data is often defined with the three characteristics: volume, velocity
and variety [1] [2] because of the nature of being constantly generated
massive data sets having large, varied and complex structures or often
unstructured (e.g. tweet text). Those three characteristics of big data imply
difficulties of storing, analyzing and visualizing for further processes and
results with traditional data analysis systems. Common problems of big
data analytics are firstly, traditional data analysis systems are not reliable
to handle the volume of data to process in an acceptable rate. Secondly, big
data processing commonly requires complex data processing in multi phases
of data cleaning, preprocessing, and transformation since data is available
in many different formats either in semi-structured or unstructured. Lastly,
big data is constantly generated at high speed by systems giving that none
of the traditional data preprocessing architectures are suitable to efficiently
process in real time or near real time.
Two common approaches to process big data are batch-mode big data
analytics and streaming-based big data analytics. Batch processing is an
efficient way to process high volumes of data where a group of transactions
is collected over time [3] . Frameworks that are based on a parallel and
distributed system architecture such as Apache Hadoop with MapReduce
currently dominate batch mode big data analytics. This type of big data
processing addresses the volume and variety components of big data analytics
but not velocity. In contrast, stream processing is a model that computes
a small window of recent data at one time [3] . This makes computation
real time or near-real time. In order to meet the demands of the real-time
constraints, the stream-processing model must be able 0to calculate statistical
analytics on the fly, since streaming data like user generated content in the
form of repeated online user interactions is continuously arriving at high
speed [3] .
This notable “high velocity” on arrival characteristic of the big data
stream means that corresponding big data analytics should be able to
process the stream in a single pass under strict constraints of time and space.
Most of the existing works that leverage the distributed parallel systems
to analyze big social media data in real-time or near real-time perform
mostly statistical analysis in real time with pre-computed data warehouse
aggregations [4] [5] or simple frequency based sentiment analysis model [6]
. More sophisticated sentiment analyses on the streaming data are mostly the
MapReduce based batch mode analytics. While it is common to find batch
mode data processing works for the sophisticated sentiment analysis on
social media data, there are only a few works that propose the systems that
perform complex real time sentiment analysis on big data streams [7] [8] [9]
and little work is found in that the proposed such systems are implemented
and tested with real time data streams.
Sentiment Analysis otherwise known as opinion mining commonly
refers to the use of natural language processing (NLP) and text analysis
techniques to extract, and quantify subjective information in a text span [10]
. NLP is a critical component in extracting useful viewpoints from streaming
data [10] . Supervised classifiers are then utilized to predict from labeled
training sets. The polarity (positive or negative opinion) of a sentence is
measured with scoring algorithms to measure a polarity level of the opinion
in a sentence. The most established NLP method to capture the essential
meaning of a document is the bag of words (or bag of n-gram) representations
[11] . Latent Dirichlet Allocation (LDA) [12] is another widely adopted
representation. However, both representations have limitations to capture
the semantic relatedness (context) between words in a sentence and suffer
from the problems such as polysemy and synonymy [13] .
A recent paradigm in NLP, unsupervised text embedding methods, such
as Skip-gram [14] [15] and Paragraph Vector [16] [17] to use a distributed
representation for words [14] [15] and documents [16] [17] are shown to be
effective and scalable to capture the semantic and syntactic relationships,
such as polysemy and synonymy, between words and documents. The
essential idea of these approaches comes from the distributional hypothesis
that a word is represented by its neighboring (context) words in that you
shall know a word by the company it keeps [18] . Le and Mikolov [16] [17]
show that their method, Paragraph Vectors, can be used in classifying movie
reviews or clustering web pages. We employed the pre-trained network with
the paragraph vector model [19] for our system for preprocessing to identify
n-grams and synonymy in our data sets.
An advanced sentiment analysis beyond polarity is the aspect based
opinion mining that looks at other factors (aspects) to determine sentiment
polarity such as “feelings of happiness sadness, or anger”. An example of
the aspect oriented opinion mining is classifying movie reviews based on
a thumbs up or downs as seen in the 2004 paper and many other papers by
Pang and Lee [10] [20] . Another technique is the lexical approach to opinion
mining developed famously by Taboda et al. in their SO-CAL calculator

[21] . The system calculated semantic orientation, i.e. subjectivity, of a
word in the text by capturing the strength and potency to which a word
was oriented either positively or negatively towards a given topic, using
advanced techniques like amplifiers and polarity shift calculations.
The single most important information needs to be identified in a sentiment
analysis is to find out about opinions and perspectives on a particular topic
otherwise known as topic-based opinion mining [22] . Topic-based opinion
mining seeks to extract personal viewpoints and emotions surrounding
social or political events by semantically orienting user-generated content
that has been correlated by topic word(s) [22] .
Despite the success of these sophisticated sentiment analysis methods,
little is known about whether they may be scalable to apply in the multi-
phased opinion analysis process to a huge text stream of user generated
expressions in real time. In this paper, we examined whether a stream-
processing big data social media sentiment analysis service can offer
scalability in processing these multi-phased top of the art sentiment analysis
methods, while offering efficient near-real time data processing of enormous
data volume. This paper also explores the methodologies of opinion analysis
of social network data. To summarize, we make the following contributions:
• We propose a fully integrated, real time text analysis framework
that performs complex multi-phase sentiment analysis on massive
text streams: Social Media Data Stream Sentiment Analysis
Service (SMDSSAS).
• We propose two sentiment models that are combined models of
topic, lexicon and aspect based sentiment analysis that can be
applied to a real-time big data stream in cooperation with the
most recent natural language processing (NLP) techniques:
• Deterministic Topic Model that accurately measures user
sentiments in the subjectivity and the context of user provided
topic word(s).
• Probabilistic Topic Model that effectively identifies polarity of
sentiments per topic correlated messages over the entire data
streams.
• We fully experimented on the popular social media Twitter
message streams captured during the pre-election weeks of the
2016 Presidential Election to test the accuracy of our two proposed
sentiment models and the performance of our proposed system
SMDSSAS for the real time sentiment analysis. The results show
that our framework can be a good alternative for an efficient and
scalable tool to extract, transform, score and analyze opinions for
the user generated big social media text streams in real time.
RELATED WORKS
Many existing works in the related literature concentrate on topic-based
opining mining models. In topic-based opinion mining, sentiment is
estimated from the messages related to a chosen topic of interest such that
topic and sentiment are jointly inferred [22] . There are many works on
the topic based sentiment analysis where the models are tested on a batch
method as listed in the reference Section. While there are many works in
the topic based models for batch processing systems, there are few works
in the literature on topic-based models for real time sentiment analysis on
streaming data. Real-time topic sentiment analysis is imperative to meet
the strict time and space constraints to efficiently process streaming data
[6] . Wang et al. in the paper [6] developed a system for Real-Time Twitter
Sentiment Analysis of the 2012 Presidential Election Cycle using the Twitter
firehose with a statistical sentiment model and a Naive Bayes classifier on
unigram features. A full suite of analytics were developed for monitoring the
shift in sentiment utilizing expert curated rules and keywords in order to gain
an accurate picture of the online political landscape in real time. However,
these works in the existing literature lacked the complexity of sentiment
analysis processes. Their sentiment analysis model for their system is based
on simple aggregations for statistical summary with a minimum primitive
language preprocessing technique.
More recent research [23] [24] have proposed big data stream processing
architectures. The first work in 2015 [23] proposed a multi-layered storm
based approach for the application of sentiment analysis on big data
streams in real time and the second work in 2016 [24] proposed a big data
analytics framework (ASMF) to analyze consumer sentiments embedded in
hundreds of millions of online product reviews. Both approaches leverage
probabilistic language models by either mimicking “document relevance”:
with probability of the document generating a user provided query term
found within the sentiment lexicon [23] or by adapting a classical language
modeling framework to enhance the prediction of consumer sentiments
[24]. However, the major limitation of their works is both the proposed
frameworks have never been implemented and tested under an empirical
setting or in real time.
ARCHITECTURE OF BIG DATA STREAM

ANALYTICS FRAMEWORK
In this Section, we describe the architecture of our proposed big data
analytics framework that is illustrated in Figure 1. Our sentiment analysis
service, namely Social Media Data Stream Sentiment Analysis Service
(SMDSSAS) consists of six layers―Data Storage/Extraction Layer, Data
Stream Layer, Data Preprocessing and Transformation Layer, Feature
Extraction Layer, Prediction Layer, and Presentation Layer. For these
layers, we employed well-proven methods and tools for real time parallel
distributed data processing. For the real time data analytics component,
SMDSSAS leverages the Apache Spark [1] [7] and a NoSQL Hive [25] big
data ecosystem, which allows us to develop a streamlined pipelining with
the natural language processing techniques for fully integrated complex
multiphase sentiment analysis that store, process and analyze user generated
content from the Twitter streaming API.
Figure 1: Architecture of social media data stream sentiment analysis service

(SMDSSAS).
The first layer is Data Storage/Extraction Layer for extraction of user

tweet fields from the Twitter Stream that are to be converted to topic filtered
DStreams through Spark in the next Data Stream layer. DStream is a
memory unit of data in Spark. It is the basic abstraction in Spark Streaming,
which is a continuous sequence of Resilient Distributed datasets (RDDs of
the same type) that represents a continuous stream of data. The extracted
Tweet messages are archived in Hive’s data warehouse store via Cloudera’s
interactive web based analytics tool Hue and direct streaming into HDFS.
The second Layer: Data Stream Layer processes the extracted live
streaming of user-generated raw text of Twitter messages to Spark contexts and
DStreams. This layer is bidirectional with both the Data Storage/Extraction
layer and the next layer the Data Preprocessing and Transformation Layer.
The third layer: Data Preprocessing and Transformation layer is in charge
of building relationships in the English natural language and cleaning raw
text twitter messages with the functions to remove both control characters
sensitive to Hive data warehouse scanner and non-alphanumeric characters
from. We employee the natural language processing techniques in the Data
Preprocessing layer with the pertained network in the paragraph vector
model [16] [17] . This layer can also employee the Stanford Dependency
Parser [26] and Named Entity Recognizer [27] to build an additional
pipelining of Dependency, Tokenizer, Sentence Splitting, POS tagging
and Semantic tagging to build more sophisticated syntax relationships in
the Data Preprocessing stage. The transformation component of this later
preprocesses in real time the streaming text in JSON to CSV formatted
Twitter statuses for Hive table inserts with Hive DDL. The layer is also in
charge of removing ambiguity of a word that is determined with pre-defined
word corpuses for the sentiment scoring process later.
The forth Layer, Feature Extraction layer, is comprised of a topic
based feature extraction function for our Deterministic and Probabilistic
sentiment models. The topic based feature extraction method employees the
Opinion Finder Subjectivity Lexicon [28] for identification and extraction
of sentiment based on the related topics of the user twitter messages.
The fifth layer of our framework: the Prediction Layer uses our two
topic and lexicon based sentiment models: Deterministic, and Probabilistic
for sentiment analysis. The accuracy of each model was measured using
the supervised classifier Multinomial Naive Bayes to test the capability of
each model for correctly identifying and correlating users’ sentiments on the
topics related data streams with a given topic (event).
Our sixth and final layer is Presentation layer that consists of a web
based user interface.
SENTIMENT MODEL
Extracting useful viewpoints (aspects) in context and subjectivity from
streaming data is a critical task for sentiment analysis. Classical approaches
of sentimental analysis have their own limitations in identifying accurate
contexts, for instance, for the lexicon-based methods; common sentiment
lexicons may not be able to detect the context-sensitive nature of opinion
expressions. For example, while the term “small” may have a negative
polarity in a mobile phone review that refers to a “small” screen size,
the same term could have a positive polarity such as “a small and handy
notebook” in consumer reviews about computers. In fact, the token “small”
is defined as a negative opinion word in the well-known sentiment lexicon
list Opinion-Finder [28] .
The sentiment models developed for SMDSSAS are based on the aspect
model [29] . The aspect based opinion mining techniques are to identify to
extract personal opinions and emotions of surrounding social or political
events by capturing semantically orienting contents in subjectivity and
context that are correlated by aspects, i.e. topic words. The design of our
sentiment model was based on the assumption that positive and negative
opinions could be estimated per a context of a given topic [22] . Therefore,
in generating data for model training and testing, we employed a topic-based
approach to perform sentiment annotation and quantification on related user
tweets.
The aspect model is a core of probabilistic latent semantic analysis
in the probabilistic language model for general co-occurrence data which
associates a class (topic) variable t∈T={t1,t2,⋯,tk} with each occurrence of a
word w∈W={w1,w2,⋯,wm} in a document d∈D={d1,d2,⋯,dn} . The Aspect
model is a joint probability model that can be defined in selecting a document
d with probability P(d), picking a latent class (topic) t with probability P(t|d),
and occurring a word (token) w with probability P(w|t).
As a result one obtains an observed pair (d,w), while the latent class
variable z is discarded. Translating this process into a joint probability
model results in the expression as follow:
(1)
where
(2)
Essentially, to derive (2) one has to sum over the possible choices of z
that could have generated the observation.
The aspect model is based on two independence assumptions: First,
any pairs (d,w) are assumed to be occurred independently; this essentially
corresponds to the bag-of-words (or bag of n-gram) approach. Secondly,
the conditional independence assumption is made that conditioned on the
latent class t, words w are occurred independently of the specific document
identity di. Given that the number of class states is smaller than the number
of documents ( K≪N), t acts as a bottleneck variable in predicting w
conditioned on d.
Following the likelihood principle, P(d), P(t|d), and P(w|t) can be
determined by maximization of the log-likelihood function
(3)
where n(d,w) denotes the term frequency, i.e., the number of times w occurred
in d. An equivalent symmetric version of the model can be obtained by
inverting the conditional probability P(t|d) with the Bayes’ theorem, which
results in
(4)
In the Information Retrieval context, this Aspect model is used to
estimate the probability that a document d is related to a query q [2] .
Such a probabilistic inference is used to derive a weighted vector in Vector
Space Model (VSM) where a document d contains a user given query q [2]
where q is a phrase or a sentence that is a set of classes (topic words) as
d∩q=T={t1,t2,⋯,tk}.
(5)
where tf.idft,d is defined as a term weight wt,d of a topic word t with tft,d being
the term frequency of a topic word tj occurs in di and idft, being the inverted
document index defined with dft the number of documents that contain tj as
below. N is the total number of documents.
(6)
Then d and q are represented with the weighted vectors for the common
terms. score(q,d) can be derived using the cosine similarity function to
capture the concept of document “relevance” of d respect to q in the context
of topic words in q. Then the cosine similarity function is defined as the
score function with the length normalized weighted vectors of q and d as

follow.
(7)
Context Identification
We derive a topic set T(q) by generating a set of all the related topic words
from a user given query (topics) q={t1,t2,⋯,tk} where q is a set of tokens. For
each token ti in q, we derive the related topic words to add to the topic set
T(q) based on the related language semantics R(ti) as follow.
(8)
where ti,tj∈T . ti.*|*.ti denote any word concatenated with ti and ti_tj denotes
a bi-gram with ti and tj, label_synonym(ti) is a set of the labeled synonyms of
ti in the dictionary identified by in WordNet [23] . For context identification,
we can choose to employee the pre-trained network with the paragraph
vector model [16] [17] for our system for preprocessing. The paragraph
vector model is more robust in identifying synonyms of a new word that is
not in the dictionary.
Measure of Subjectivity in Sentiment: CMSE and CSOM

The design of our experiments of each model were intended to capture social
media Twitter data streams of a surrounding special social or political event,
so we targeted to capture data streams to test during two special events―the
2016 US Presidential election and the 2017 Inauguration. The real time user
tweet streams were collected from Oct. 2016 to Jan. 2017. The time frames
chosen are a pre-election time of the October 23rd week and the pre-election
week of November 5th, as well as a pre-inauguration time the first week of
January 2017.
We define the document-level polarity sentiment(di) with a simple
polarity function that simply counts the number of positive and negative
words in a document (a twitter message) to determine an initial sentiment
measure sentiment(di) and the sentiment label sentimenti for each document
di as follow:
(9)
where di is a document (message) in a tweet stream D of a given topic set T
with 1 ≤ i < n and di={w1,⋯,wm} , m is the number of words in di. Pos(wk)
= 1 if wk is a positive word and Neg(wk) = −1 if wk is a negative word.
sentiment(di) is the difference between the frequency of the positive words
denoted as FreqPos(di) and the frequency of negative words denoted as
FreqNeg(di) in di indicating an initial opinion polarity measure with −m ≤
sentiment(di) ≤ m and a sentiment label of di sentimenti = 1 for positive if
sentiment(di) ≥ 1, 0 if sentiment(di) = 0 for neutral, and −1 for negative if
sentiment(di) ≤ −1.
Then, we define w(di) a weight for a sentiment orientation for di
to measure a subjectivity of sentiment orientation of a document, then a
weighted sentiment measure for di senti_score(di) is defined with w(di) and
sentimenti the sentiment label of di as a score of sentiment of di as follow:
(10)
(11)
where −1 ≤ w(di) ≤ 1, and α is a control parameter for learning. When α =
0, senti_score(di) = sentimenti. senti_score(di) gives more weight toward a
short message with strong sentiment orientation. w(di) = 0 for neural.
Class Max Sentiment Extraction (CMSE): To test the performance of our
models and to predict the outcomes of events such as the 2016 Presidential
election from the extracted user opinions embedded in tweet streams, we
quantify the level of the sentiment in the data set with Class Max Sentiment
Extraction (CMSE) to generate statistically relevant absolute sentiment
values to measure an overall sentiment orientation of a data set for a given
topic set for each sentiment polarity class to compare among different data
sets. To quantify a sentiment of a data set D of a given topic set T, we define
CMSE(D(T)) as follow.
For a given Topic set T, for each di∈D(T) where di contains at least one
of the topic words of interest in T in a given tweet stream D, CMSE(D(T))
returns a weighted sum of senti_score(di) of the data set D on T as follow:
(12)
(13)
(14)
where 1 ≤ i < n and D(T)={d1,⋯,dn} , n is the number of documents in D(T).
CMSE is to measure the maximum sentiment orientation values of each
polarity class for a given topic correlated data set D(T). It is a sum of the
weighted document sentiment scores for each sentiment class―positively
labeled di, negatively labeled di, and neutrally labeled di respectively in a
given data set D(T) for a user given topic word set T. CMSE is the same as
an aggregated count of sentimenti when α = 0.
CMSE is an indicator of how strongly positive or negative the sentiment
is in a data set for a given topic word set T where D(T) is a set of documents
(messages) in a tweet stream where each document di∈D(T) 1≤ i ≤ n, contains
at least one of the topic words tj∈T={t1,⋯,tk} with 1 ≤ j ≤ k and T is a set
of all the related topic words derived from a user given query q as a seed to
generate T. Tj, which is a subset of T, is a set of topic words that is derived
from a given topic tj∈T . D(Tj), a subset of D(T), is a set of documents where
each document di contains at least one of the related topic words in a topic
set Tj. Every topic word set is derived by the Context Identifier described
in Section 4.1. With the Donald Trump and Hillary Clinton example, three
topic-correlated data sets are denoted as below.
D(Tj) is a set of documents with a topic word set Tj derived from {Donald
Trump|Hillary Clinton}.
D(TRj) is a set of documents, a subset of D(Tj), with a topic word set TRj
derived from {Donald Trump}.
D(HCj) is a set of documents, a subset of D(Tj), with a topic word set
HCj derived from{Hillary Clinton}.
where m, are the number of document di in D(TRj) and D(HCj)
respectively. For example, CMSEpos(D(TRj)) is the maximum positive
opinion measure in the tweet set that are talking about the candidate Donald
Trump.
CSOM (Class Sentiment Orientation Measure): CSOM is to measure a
relative ratio of the level of the positive and negative sentiment orientation
for a given topic correlated data set over the entire dataset of interest.
For CSOM, we define two relative opinion measures: Semantic
Orientation (SMO) and Sentiment Orientation (STO) to quantify a polarity
for a given data set correlated with a topic set Tj. SMO indicates a relative
polarity ratio between two polarity classes within a given topic data set. STO
indicates a ratio of the polarity of a given topic set over an entire data set.
With our Trump and Hillary example from the 2016 Presidential Election
event, the positive SMO for the data set D(TRj) with the topic word “Donald
Trump” and the negative SMO for the Hillary Clinton topic set D(HCj) can
be derived for each polarity class respectively as below. For example, the
positive SMO for a topic set D(TRj) for Donald Trump and the negative
SMO for a topic set D(HCj) for Hillary Clinton are defined as follow:
(15)
(16)
When α = 0, senti_score(di) = sentimenti, so CMSE and SMO are
generated with count of sentimenti of the data set. Then, Sentiment
Orientation (STO) for a topic set D(TRj) for Donald Trump and the negative
STO for a topic set D(HCj) for Hillary Clinton are defined as follow:
(17)
(18)
where Weight(TRj) and Weight(HCj) are the weights of the topics over
the entire dataset, defined as follow. Therefore, STO(TRj) indicates a
weighted polarity of the topic TRj over the entire data set D(Tj) where
D(Tj)=D(TRj)∪D(HCj).
(19)
(20)
Deterministic Topic Model

The Deterministic Topic Model considers the context of the words in the
texts and the subjectivity of the sentiment of the words given the context.
Given the presumption that topic and sentiment can be jointly inferred, the
Deterministic Topic Model measures polarity strength of sentiment in the
context of user provided topic word(s). Deterministic Topic Model considers
subjectivity of each word (token) in di in D(Tj). Likelihoods were estimated
as relative frequencies with the weighted subjectivity of a word. Using the
Opinion Finder [28] , lexicon of the tweets were categorized and labeled
by subjectivity and polarity. The 6 different weight levels below define
subjectivity. Each token was categorized to one of the 6 strength scales and
weighted with subjectivity strength scale range from −2 to +2 where −2
denotes “strongest” subjective negative; +2: strongest subjective positive
word.
subjScale(wt) is defined as Subjectivity Strength Scale for each token wt
in di. The weight of each group is assigned as below for the 6 subjectivity
strength sets. Any token that does not belong to any of the 6 subjectivity
strength sets is set to 0.
strSubjPosW:= {set of strong positive subjective words}: +2
wkSubjPosW:= {set of weak positive subjective words}: +1
strSubjNeuW:= {set of strong neutral subjective words}: 0.5
wkSubjNeuW:= {set of weak neutral subjective words}: −0.5
wkSubjNegW:= {set of weak negative subjective words}: −1.
strSubjNegW:= {set of strong negative subjective words}: −2
None = None of above: 0
(21)
0m is the number of tokens in di. −m∗2≤SentimentSubj(di)≤m∗2. Note that
subjScale(wt) of each neutral word is not 0. We consider a strong neutral
opinion as a weak positive and a weak neutral as a weak negative by as-
signing very small positive or negative weights. The sentiment of each di is
then defined by the sum of the frequency of each subjectivity group with its
weighted subjScale.
(22)
(23)
Then CMSESubj(D(T)) is a sum of subjectivity weighted opinion polarity
for a given topic set D(T) with di∈D(T) . It can be defined with senti_score_
subj(di) as follow.
(24)
(25)
(26)
Then, we define our deterministic model ρε(Pos_Tj) as a length
normalized sum of subjectivity weighted senti_score_subj(di) for a given
topic Tj with di∈D(Tj) as follow:
(27)
where D(T) is a set of documents (messages) in a tweet stream where each
document di∈D(T),1≤i≤n , contains one of the topic words tj∈T={t1,⋯,tk}
where 1 ≤ j ≤ k and T is a set of all the related topic words derived from the
user given topics and Tj is a set of all the topic words that are derived from
a given query q as defined in the Section 4.1. D(Tj) , a subset of D(T), is a
set of documents where each document di contains one of the related topic
words in a topic set Tj and n is the number of document di in D(Tj).
Probabilistic Topic Model

The probabilistic model adopts SMO and STO measures of CSOM with
the subjectivity to derive a log-based modified log-likelihood of the ratio of
subjectivity weighted PosSMO and NegSMO over a given topic set D(T)
and a subset D(Tj).
Our probabilistic model ρ with a given topic set D(T) and D(Tj) measures
the probability of sentiment polarity of a given a topic set D(Tj) where D(Tj)
is a subset of D(T). For example, the probability of the positive opinion for
Trump in D(T), denoted as P(Pos_TR), is defined as follow:
(28)
ϵϵ is a smoothing factor [30] and we consider a strong neutral subjectivity as
a weak positivity here. Then, we define our probabilistic model ρ(POS_TR)
ρ(POS_TR) as
(29)
where NegativeInfo(TR) is essentially a subjectivity weighted NegSMO(TRj)
defined as follow.
(30)
Our probabilistic model penalizes with the weight of the negative
opinion in the correlated topic set D(TR) when measuring a positive opinion
of a topic over a given entire data set D(T).
(31)
Multinomial Naive Bayes

The fifth layer of our framework: the Prediction Layer employees the
Deterministic and Probabilistic sentiment models discussed in Section
4 to our predictive classifiers for event outcome prediction in a real-time
environment. Predictive performance of each model was measured using a
supervised predictive analytics model: Multinomial Naive Bayes.
Naive Bayes is a supervised probabilistic learning method popular
for text categorization problems in judging if documents belong to one
category or another because it is based on the assumption that each word
occurrence in a document is independent as in “bag of word” model. Naive
Bayes uses a technique to construct a classifier: it assigns class labels to

problem instances represented as vectors of feature values where class
labels are drawn from a finite set [31] . We utilized the Multinomial model
for text classification based on “bag of words” model for a document [32] .
Multinomial Naive Bayes models the distribution of words in a document
as a multinomial. A document is treated as a sequence of words and it is
assumed that each word position is generated independently of every other.
For classification, we assume that there are a fixed number of classes, where
a class Ck∈{C1,C2,⋯,Cm}, each with a fixed set of multinomial parameters.
The parameter vector for a class Ck={Ck1,⋯,Ckn} where n is the size of the
vocabulary, and ∑Ck=1.
(32)
In a multinomial event model, a document is an ordered sequence of
word events, that represent the frequencies which certain events have been
generated by a multinomial (p1⋯pn) where pi is the probability that event i
occurs, and xi is the feature vector counting the number of times event i was
observed in an instance [32] . Each document di is drawn from a multinomial
distribution of words with as many independent trials as the length of di,
yielding a “bag of words” representation for the documents [32] . Thus the
probability of a document given its class is the representation of k such
multinomial [32] .
EXPERIMENTS
We applied our sentiment models discussed in Section(s) 4.2 and 4.3 on the
real-time twitter stream for the following events―the 2016 US Presidential
election and the 2017 Inauguration. User opinion was identified extracted
and measured surrounding the political candidates and corresponding
election policies in an effort to demonstrate SMDSSAS’s accurate critical
decision making capabilities.
A total of 74,310 topic-correlated tweets were collected randomly chosen
on a continuous 30-second interval in Apache Spark DStream accessing the
Twitter Streaming API for the pre-election week of November 2016 and the
pre-election month on October, as well as pre-inauguration week in January.
The context detector on the following topics generates the set of topic words:
Hillary Clinton, Donald Trump and political policies. The number of the
topic correlated tweets for the candidate Donald Trump was ~53,009 tweets
while the number of the topic correlated tweet for the candidate Hillary
Clinton was ~8510 which is a lot smaller than that of Trump.
Tweets were preprocessed with a custom cleaning function to remove
all non-English characters including: the Twitter at “@” and hash tag “#”
signs, image/website URLS, punctuation: “[. , ! “ ‘]”, digits: [0-9], and
non-alphanumeric characters: $ % & ^ * () + ~ and stored in NoSql Hive
database. Each topic-correlated tweet was labeled for sentiment using the
OpinionFinder subjectivity word lexicon and the subjScale(wt) defined in 4.3
associating a numeric value to each word based on polarity and subjectivity
strength.
Predicting the Outcome of 2016 Presidential Election

in Pre-Election Weeks
Figure 2 shows the results of analysis of sentiment orientation on the two
presidential candidates for the several months of pre-election 2016 tweet
traffic. We noted the lowest positive polarity measure (0.11) for Donald
Trump occurred during the pre-election October, but it soared to more
than double (0.26) on the election month November and (0.24) on pre-
inauguration January 2017. His negative sentiment orientation was already
a lot lower (0.022) than his positive orientation on October and it kept
dropping to 0.016 for November and January.
Figure 2: Measuring pre-election sentiment orientation shift for 2016 presiden-

tial election cycle.
In contrast, Hillary Clinton’s positive and negative sentiment orientation

measures were consistently low during October and November; her
positive sentiment measure was ranging from 0.022 on October to 0.016 on
November, which is almost ten times smaller than Trump’s. It kept dropping
to 0.007 on January. Clinton’s negative orientation measure was 10 times
higher than Trump’s ranging from 0.03 on October to 0.01 on November,
but it decreased to 0.009 on January.
Predicting with Deterministic Topic Model

Our Deterministic Topic Model as discussed in 4.3 was applied to the
November 2016 pre-election tweet streams. The positive polarity orientation
for Donald Trump was increased to 0.60 while the positive polarity measure
for Hillary Clinton was 0.069. From our results show in Figure 3(b) below, we
witnessed the sharply increased positive sentiment orientation for candidate
Donald Trump in the data streams during the pre-election November with
candidate Donald Trump’s volume of Trump-correlated topic tweets (53,009
tweets) compared to that for Hillary Clinton (8510 tweets) for Subjectivity
Weighted CMSE shown in Figure 3(a). Our system showed that Donald
Trump as the definitive winner of the 2016 Presidential Election.
Cross Validation with Multinomial Naive Bayes Classifier for

Deterministic and Probabilistic Models
Our cross validation was performed with the following experiment settings
and an assumption that for a user chosen time period for a user given topic
(event), data streams are collected from randomly chosen time frames, each
in a 30 sec window, from the same social platform where the messages occur
randomly for both candidates.
(a)
(a)
(a)
(b)
(b)
Figure 3: (a) Polarity comparison of two candidates: Clinton vs Trump with
CMSE and subjectivity weighted CMSE; (b) Comparison of positive sentiment
measure of two candidates with Pos_SMO and deterministic model.
To validate parallel stream data processing, we adopt the method of
evaluation of big data stream classifier proposed in Bifet 2015 [7] . The
standard K-fold cross-validation, which is used in other works with batch
methods, treats each fold of the stream independently, and therefore
may miss concept drift occurring in the data stream. To overcome these
problems, we employed the strategy K-fold distributed cross-validation
[7] to validate stream data. Assuming we have K different instances of the
classifier, we want to evaluate running in parallel. The classifier does not
need to be randomized. Each time a new example arrives, it is used in 10-
fold distributed cross-validation: each example was used for testing in one
classifier selected randomly, and used for training by all the others.
10 fold distributed cross validation were performed on our stream data
processing in each two different data splits. 60%: 40% training data: test
data, and 90%: 10%. Average accuracy was taken for each split, for each
deterministic and probabilistic model. Each cross validation was performed
with classifier optimization parameters providing the model a variance of
smoothing factors, features for term frequency and numeric values for min
document frequency. Figure 4 illustrates the accuracies of deterministic
and probabilistic models. 10 fold cross validation on 90%: 10% split with
Deterministic model showed the highest accuracy with an average accuracy
of 81% and the average accuracy of the Probabilistic model showed the
almost comparable result with 80%. In comparison with the existing works,
the overall average accuracy from the cross validation on each model shows
1% - 22% improvement from the previous work [6] [7] [8] [9] [22] [23]
[24] [29] [30] . Figure 4 below illustrates the cross validation results of the
Deterministic and Probabilistic models.
CONCLUSIONS
The main contribution of this paper is the design and development of a
real time big data stream analytic framework; providing a foundation for
an infrastructure of real time sentiment analysis on big text streams. Our
framework is proven to be an efficient, scalable tool to extract, score and
analyze opinions on user generated text streams per user given topics in
real time or near real time. The experiment results demonstrated the ability
of our system architecture to accurately predict the outcome of the 2016
Presidential Race against candidates Hillary Clinton and Donald Trump.
The proposed fully analytic Deterministic and Probabilistic sentiment
models coupled with the real-time streaming components were tested on the
user tweet streams captured during pre-election month in October 2016 and
the pre-election week of November 2016. The results proved that our system
was able to predict correctly Donald Trump as the definitive winner of the
2016 Presidential election.
Figure 4: Average cross validation prediction accuracy on real time pre election
tweet streams of 2016 presidential election for deterministic vs. probabilistic
model.
The cross validation results showed that the Deterministic Topic Model
in real time processing consistently improved the accuracy with average
81% and the Probabilistic Topic Model with average 80% compared to the
accuracies of the previous works, ranging from 59% to 80%, in the literature
[6] [7] [8] [9] [22] [23] [24] [29] [30] that lacked the complexity of sentiment
analysis processing, either in batch or real time processing.
Finally, SMDSSAS performed efficient real-time data processing and
sentiment analysis in terms of scalability. The system uses the continuous
processing of a smaller window of data stream (e.g. consistent processing
of a 30sec window of streaming data) in which machine learning analytics
were performed on the context stream resulting in more accurate predictions
with the ability of the system to continuously apply multi-layered fully
analytic processes with complex sentiment models to a constant stream
of data. The improved and stable model accuracies demonstrate that our
proposed framework with the two sentiment models offers a scalable real-
time sentiment analytic framework alternative for big data stream analysis
over the traditional batch mode data analytic frameworks.
ACKNOWLEDGEMENTS
The research in this paper was partially supported by the Engineering
College of CSU under the Graduate Research grant.
REFERENCES
1. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K.,
Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A. and Zaharia, M.
(2015) Spark SQL: Relational Data Processing in SPARK. Proceedings
of the ACM SIGMOD International Conference on Management
of Data, Melbourne, 31 May-4 June 2015, 1383-1394. https://doi.
org/10.1145/2723372.2742797
2. Sagiroglu, S. and Sinanc, D. (2013) Big Data: A Review. 2013
International Conference on Collaboration Technologies and Systems
(CTS), San Diego, 20-24 May 2013, 42-47. https://doi.org/10.1109/
CTS.2013.6567202
3. Lars, E. (2015) What’s the Best Way to Manage Big Data for Healthcare:
Batch vs. Stream Processing? Evariant Inc., Farmington.
4. Hu, M. and Liu, B. (2004) Mining and Summarizing Customer Reviews.
Proceedings of the Tenth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Seattle, 22-25 August 2004,
168-177.
5. Liu, B. (2010) Sentiment Analysis and Subjectivity. In: Indurkhya, N.
and Damerauthe, F.J., Eds., Handbook of Natural Language Processing,
2nd Edition, Chapman and Hall/CRC, London, 1-38.
6. Wang, H., Can, D., Kazemzadeh, A., Bar, F. and Narayanan, S.
(2012) A System for Real-Time Twitter Sentiment Analysis of 2012
U.S. Presidential Election Cycle. Proceedings of ACL 2012 System
Demonstrations, Jeju Island, 10 July 2012, 115-120.
7. Bifet, A., Maniu, S., Qian, J., Tian, G., He, C. and Fan, W. (2015)
StreamDM: Advanced Data Mining in Spark Streaming. IEEE
International Conference on Data Mining Workshop (ICDMW),
Atlantic City, 14-17 November 2015, 1608-1611.
8. Kulkarni, S., Bhagat, N., Fu, M., Kedigehalli, V., Kellogg, C., Mittal,
S., Patel, J.M., Ramasamy, K. and Taneja, S. (2015) Twitter Heron:
Stream Processing at Scale. Proceedings of the 2015 ACM SIGMOD
International Conference on Management of Data, Melbourne, 31
May-4 June 2015, 239-250. https://doi.org/10.1145/2723372.2742788
9. Nair, L.R. and Shetty, S.D. (2015) Streaming Twitter Data Analysis
Using Spark For Effective Job Search. Journal of Theoretical and
Applied Information Technology, 80, 349-353.
10. Pang, B. and Lee, L. (2004) A Sentimental Education: Sentiment

Analysis Using Subjectivity Summarization Based on Minimum
Cuts. Proceedings of the 42nd Annual Meeting on Association for
Computational Linguistics, Barcelona, 21-26 July 2004, 271-278.
https://doi.org/10.3115/1218955.1218990
11. Harris, Z. (1954) Distributional Structure Word. WORD, 10, 146-162.
https://www.tandfonline.com/doi/abs/10.1080/00437956.1954.11659
520
12. Blei, D., Ng, A. and Jordan, N. (2003) Latent Dirichlet Allocation.
Journal of Machine Learning Research, 3, 993-1022.
13. Zhai, C. and Lafferty, J. (2004) A Study of Smoothing Methods for
Language Models Applied to Information Retrieval. ACM Transactions
on Information Systems, 22, 179-214.
14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J.
(2013) Distributed Representations of Words and Phrases and Their
Compositionality. Proceedings of the 26th International Conference on
Neural Information Processing Systems, Lake Tahoe, 5-10 December
2013, 3111-3119
15. Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013) Efficient
Estimation of Word Representations in Vector Space. Proceedings of
Workshop at ICLR, Scottsdale, 2-4 May 2013, 1-11.
16. Dai, A., Olah, C. and Le, Q. (2015) Document Embedding with
Paragraph Vectors. arXiv:1507.07998.
17. Le, Q. and Mikolov, T. (2014) Distributed Representations of Sentences
and Documents. Proceedings of the 31st International Conference on
Machine Learning (ICML-14), Beijing, 21-26 June 2014, II1188-
II1196.
18. Firth, J.R. (1930) A Synopsis of Linguistic Theory 1930-1955. In:
Firth, J.R., Ed., Studies in Linguistic Analysis, Longmans, London,
168-205.
19. Tang, J., Qu, M. and Mei, Q.Z. (2015) PTE: Predictive Text Embedding
through Large-Scale Heterogeneous Text Networks. arXiv:1508.00200.
20. Bo, P. and Lee, L. (2008) Opinion Mining and Sentiment Analysis.
In: de Rijke, M., et al., Eds., Foundations and Trends® in Information
Retrieval, James Finlay Limited, Ithaca, 1-135. https://doi.
org/10.1561/1500000011
21. Maite, T., Brooke, J., Tofiloski, M., Voll, K. and Stede, M. (2011)
Lexicon-Based Methods for Sentiment Analysis. Computational
Linguistics, 37, 267-307. https://doi.org/10.1162/COLI_a_00049
22. O’Connor, B., Balasubramanyan, R., Routledge, B. and Smith, N.
(2010) From Tweets to Polls: Linking Text Sentiment to Public Opinion
Time Series. Proceedings of the International AAAI Conference on
Weblogs and Social Media (ICWSM 2010), Washington DC, 23-26
May 2010, 122-129.
23. Cheng, K.M.O. and Lau, R. (2015) Big Data Stream Analytics
for Near Real-Time Sentiment Analysis. Journal of Computer and
Communications, 3, 189-195. https://doi.org/10.4236/jcc.2015.35024
24. Cheng, K.M.O. and Lau, R. (2016) Parallel Sentiment Analysis with
Storm. Transactions on Computer Science and Engineering, 1-6.
25. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N.,
Antony, S., Liu, H. and Murthy, R. (2010) Hive—A Petabyte Scale
Data Warehouse Using Hadoop. Proceedings of the International
Conference on Data Engineering, Long Beach, 1-6 March 2010, 996-
1005.
26. Manning, C., Surdeanu, A., Bauer, J., Finkel, J., Bethard, S. and
McClosky, D. (2014) The Stanford CoreNLP Natural Language
Processing Toolkit. Proceedings of 52nd Annual Meeting of the
Association for Computational Linguistics: System Demonstrations,
Baltimore, 23-24 June 2014, 55-60. https://doi.org/10.3115/v1/P14-
5010
27. Finkel, J., Grenager, T. and Manning, C. (2005) Incorporating Non-
Local Information into Information Extraction Systems by Gibbs
Sampling. Proceedings of the 43rd Annual Meeting of the Association
for Computational Linguistics (ACL 2005), Ann Arbor, 25-30 June
2005, 363-370.
28. Wilson, T., Wiebe, J. and Hoffman, P. (2005) Recognizing Contextual
Polarity in Phrase-Level Sentiment Analysis. Proceedings of the
Conference on Human Language Technology and Empirical Methods
in Natural Language Processing, Vancouver, 6-8 October 2005, 347-
354. https://doi.org/10.3115/1220575.1220619
29. Wang, S., Zhiyuan, C. and Liu, B. (2016) Mining Aspect-Specific
Opinion Using a Holistic Lifelong Topic Model. Proceedings of the
25th International Conference on World Wide Web, Montréal, 11-15

April 2016, 167-176. https://doi.org/10.1145/2872427.2883086
30. Yi, J., Nasukawa, T., Bunescu, R. and Niblack, W. (2003) Sentiment
Analyzer: Extracting Sentiments about a Given Topic Using Natural
Language Processing Techniques. Proceedings of IEEE International
Conference on Data Mining (ICDM), Melbourne, 22-22 November
2003, 1-8. https://doi.org/10.1109/ICDM.2003.1250949
31. Tilve, A. and Jain, S. (2017) A Survey on Machine Learning Techniques
for Text Classification. International Journal of Engineering Sciences
and Research, 6, 513-520.
32. Rennie, J., Shih, L., Teevan, J. and Karger, D. (2003) Tackling the
Poor Assumptions of Naive Bayes Text Classifiers. Proceedings of the
Twentieth International Conference on International Conference on
Machine Learning, Washington DC, 21-24 August 2003, 616-623.
Chapter 7
The Influence of Big Data Analytics in

the Industry
Haya Smaya
Mechanical Engineering Faculty, Institute of Technology, MATE Hungarian University
of Agriculture and Life Science, Gödöllo”, Hungary
ABSTRACT
Big data has appeared to be one of the most addressed topics recently, as
every aspect of modern technological life continues to generate more and
more data. This study is dedicated to defining big data, how to analyze it,
the challenges, and how to distinguish between data and big data analyses.
Therefore, a comprehensive literature review has been carried out to define
and characterize Big-data and analyze processes. Several keywords, which
are (big-data), (big-data analyzing), (data analyzing), were used in scientific
research engines (Scopus), (Science direct), and (Web of Science) to acquire
up-to-date data from the recent publications on that topic. This study shows
the viability of Big-data analysis and how it functions in the fast-changeable
world. In addition to that, it focuses on the aspects that describe and anticipate
Citation: Smaya, H. (2022), “The Influence of Big Data Analytics in the Industry”.
Open Access Library Journal, 9, 1-12. doi: 10.4236/oalib.1108383.
Big-data analysis behaviour. Besides that, it is important to mention that

assessing the software used in analyzing would provide more reliable output
than the theoretical overview provided by this essay.
Keywords: Big Data, Information Tools
INTRODUCTION
The research background is dedicated to defining big data, how to analyze it,
the challenges, and how to distinguish between data and big data analyses.
Therefore, a comprehensive literature review has been carried out to define
and characterize Big-data and analyze processes. Several keywords, which
are (big-data), (big-data analyzing), (data analyzing), were used in scientific
research engines (Scopus), (Science direct), and (Web of Science) to acquire
up-to-date data from the recent publications on that topic.
The Problem this paper wants to solve is to show the viability of Big-
data analysis and how it functions in the fast-changeable world. In addition
to that, it focuses on the aspects that describe and anticipate Big-data
analysis behaviour. Big Data is omnipresent, and there is an almost urgent
need to collect and protect whatever data is generated. In recent years, big
data has exploded in popularity, capturing the attention and investigations of
researchers all over the world. Because data is such a valuable tool, making
proper use of it may help people improve their projections, investigations,
and decisions [1]. The growth of science has driven everyone to mine and
consume large amounts of data for the company, consumer, bank account,
medical, and other studies, which has resulted in privacy breaches or
intrusions in many cases [2]. The promise of data-driven decision-making
is now widely recognized, and there is growing enthusiasm for the concept
of “Big Data,” as seen by the White House’s recent announcement of new
financing programs across many agencies. While Big Data’s potential is
real―Google, for example, is thought to have given 54 billion dollars. In
2009 to the US economy―there is no broad unanimity on this [3].
It is difficult to recall a topic that received so much hype as broadly and
quickly as big data. While barely known a few years ago, big data is one of
the most discussed topics in business today across industry sectors [4].
This study is dedicated to defining the Big-data concept, assessing its
viability, and investigating the different methods of analyzing and studying
it.
The Influence of Big Data Analytics in the Industry 155
STATUS QUO OVERVIEW

This chapter will provide a holistic assessment of the Big-data concept
based on several studies carried out in the last ten years, in addition to the
behaviour of big-data, features, and methodologies to analyze it.
Big-Data Definition and Concept

Big data analytics is the often-complex process of analyzing large amounts
of data to identify information such as hidden patterns, correlations, market
trends, and customer preferences that can assist businesses in making better
decisions [5]. Big Data is today’s biggest buzzword, and with the quantity
of data generated every minute by consumers and organizations around the
world, Big Data analytics holds a huge potential [6].
To illustrate the importance of Big-data in our world nowadays,
(Spotify) could be a good example of how Big-data works. (Spotify) has
nearly 96 million users that generate a vast amount of data every day. By
analyzing this data and based on it, the cloud-based platform suggests songs
automatically using a smart recommendation engine. This huge amount of
data is the likes, shares, search history, and every click on the application.
Some researchers estimate that Facebook generates more than 500 terabytes
of data every day, including photos, videos, and messages. Everything we
do online in every industry uses mostly the same concept; therefore, big data
get all this hype [7].
Generally, Big-data is a massive amount of data set that cannot be stored,
processed, or analyzed using traditional tools [8]. This data could also exist
in several forms, such as structured data and semi-structured data. The
structured data might be an Excel sheet that has a definite format. At the same
time, Semi-structured data could be resembled by an email, for example.
Unstructured data are undetermined pictures and videos. Combining all
these types of data creates what is so-called (Big-data) (Figure 1) [9] [10].
Characteristics of Big-Data
Firstly, it is essential to differentiate between Big-data and structured data
(which is usually stored in relational database systems) based on five
parameters (Figure 2) which are:
1-Volume 2-Variety 3-Velocity 4-Value 5-Veracity
And usually, it can be referred to these parameters as (5V’s) that are the
main challenges of Big-data management:
1. Volume: Volume is the major challenge for Big-data and the

paramount aspect that distinguishes it. Big-data volume is not
measured by gigabytes but by terabytes
Figure 1: Scheme of big-data analyzing the output [9].
Figure 2: 5V concept. The source [12].

(1 tera = 1000 Giga) and petabytes (1 Peta = 1000 terra). The cost of storing
this tremendous amount of data is a hurdle for the data scientist to overcome.
2. Variety: Variety refers to the different data types such as
structured, unstructured, and semi-structured data in relational
database storage systems. The data format could be in the
forms as documents, emails, social media text messages, audio,
video, graphics, images, graphs, and the output from all types of
machine-generated data from various sensors, devices, machine
logs, cell phone GPS signals and more [11].
3. Velocity: The motion of the data sets is a significant aspect to
categorize data types based on it. Data-at-rest and data-in-motion
is the term that deals with velocity. The major concern is the
consistency and completeness of fast-paced data streams and
getting the desired result matching. Velocity also includes time
and latency characteristics: the data being analyzed, processed,
stored, managed, and updated at a first-rate or with a lag time

between the events.
4. Value: Value deal with what value should be resulted from a set
of data.
5. Veracity: Veracity describes the quality of data. Is the data noiseless
or conflict-free? Accuracy and completeness are concerned.
BIG-DATA ANALYSIS
Viability of Big-Data Analysis

Big data analytics assists businesses in harnessing their data and identifying
new opportunities. As a result, smarter business decisions, more effective
operations, higher profits, and happier consumers are the result. More than
50 firms were interviewed for the publication Big Data in Big Companies
(Figure 3) [13] to learn how they used big data.
Figure 3: Frequency distribution of documents containing the term “big data”

in ProQuest Research Library. The source [6].
According to the report, they gained value in the following ways:
Cost reduction. When it comes to storing massive volumes of data, big
data technologies like Hadoop and cloud-based analytics provide significant
cost savings―and they can also find more effective methods of doing
business.
Faster, better decision-making. Businesses can evaluate information

instantaneously―and make decisions based on what they’ve learned―
thanks to Hadoop’s speed and in-memory analytics, as well as the ability to
study new sources of data.
New products and services. With the capacity to use analytics to measure
client requirements and satisfaction comes the potential to provide customers
with exactly what they want. According to Davenport, more organizations
are using big data analytics to create new goods to fulfill the needs of their
customers.
Analyzing Process
Analyzing Steps
Data analysts, data scientists, predictive modellers, statisticians, and other
analytics experts collect, process, clean, and analyze increasing volumes of
structured transaction data, as well as other types of data not typically used
by traditional BI and analytics tools. The four steps of the data preparation
process are summarized below (Figure 4) [7]:
1) Data specialists gather information from a range of sources. It’s
usually a mix of semi-structured and unstructured information.
While each company will use different data streams, the following
are some of the most frequent sources:
• clickstream data from the internet
Figure 4: Circular process steps of data analysis [7].

• web server logs

• cloud apps
• mobile applications
• social media content
• text from consumer emails and survey replies
• mobile phone records
• machine data collected by the internet of things sensors (IoT).
2) The information is analyzed. Data professionals must organize,
arrange, and segment data effectively for analytical queries
after it has been acquired and stored in a data warehouse or data
lake. Analytical queries perform better when data is processed
thoroughly.
3) The data is filtered to ensure its quality. Data scrubbers use
scripting tools or corporate software to clean up the data.
They organize and tidy up the data, looking for any faults or
inconsistencies such as duplications or formatting errors.
4) Analytics software is used to analyze the data that has been
collected, processed, and cleaned. This contains items such as:
• Data mining, which sifts through large data sets looking for
patterns and connections.
• Predictive analytics, which involves developing models to predict
customer behaviour and other future events.
• Machine learning, which makes use of algorithms to evaluate
enormous amounts of data.
• Deep learning, a branch of machine learning that is more
advanced.
• Program for text mining and statistical analysis.
• Artificial intelligence (AI).
• Business intelligence software that is widely used.
Analyzing Tools
To support big data analytics procedures, a variety of tools and technologies
are used [14]. The following are some of the most common technologies and
techniques used to facilitate big data analytics processes:
• Hadoop is a free and open-source framework for storing and

analyzing large amounts of data. Hadoop is capable of storing
and processing enormous amounts of structured and unstructured
data.
• Predictive analytics large volumes of complicated data are
processed by hardware and software, which uses machine learning
and statistical algorithms to forecast future event outcomes.
Predictive analytics technologies are used by businesses for fraud
detection, marketing, risk assessment, and operations.
• Stream analytics are used to filter, combine, and analyze large
amounts of data in a variety of formats and platforms.
• Distributed storage data is usually replicated on a non-relational
database This can be a safeguard against independent node
failures, the loss or corruption of large amounts of data, or the
provision of low-latency access.
• NoSQL databases non-relational data management methods
come in handy when dealing with vast amounts of scattered data.
They are appropriate for raw and unstructured data because they
do not require a fixed schema.
• A data lake is a big storage repository that stores raw data in native
format until it’s needed. A flat architecture is used in data lakes.
• A data warehouse is a data repository that holds vast amounts of
data gathered from many sources. Predefined schemas are used to
store data in data warehouses.
• Knowledge discovery/big data mining tools businesses will be
able to mine vast amounts of structured and unstructured big data.
• In-memory data fabric large volumes of data are distributed
across system memory resources. This contributes to minimal
data access and processing delay.
• Data virtualization allows data to be accessed without any
technical limitations.
• Data integration software enables big data to be streamlined across
different platforms, including Apache, Hadoop, MongoDB, and
Amazon EMR.
• Data quality software, cleans and enriches massive amounts of
data
• Data preprocessing software, which prepares data to be analyzed

further Unstructured data is cleared and data is prepared.
• Spark: which is a free and open-source cluster computing platform
for batch and real-time data processing.
Different Types of Big Data Analytics

Here are the four types of Big Data analytics:
1) Descriptive Analytics: This summarizes previous data in an easy-
to-understand format. This aids in the creation of reports such as
a company’s income, profit, and sales, among other things. It also
aids in the tally of social media metrics.
2) Diagnostic Analytics: This is done to figure out what created the
issue in the first place. Drill-down, data mining, and data recovery
are all instances of techniques. Diagnostic analytics is used by
businesses to gain a deeper understanding of a problem.
3) Predictive Analytics: This sort of analytics examines past and
current data to create predictions. Predictive analytics analyzes
current data and makes forecasts using data mining, artificial
intelligence, and machine learning. It predicts customer and
market trends, among other things.
4) Prescriptive Analytics: This type of analytics recommends
a remedy to a specific issue. Both descriptive and predictive
analytics are used in perspective analytics. Most of the time, AI
and machine learning are used.
Big Data Analytics Benefits

Big data analytics has several advantages, including the ability to swiftly
evaluate massive amounts of data from numerous sources in a variety of
forms and types (Figure 5).
Making better-informed decisions more quickly for more successful
strategizing, can benefit and improve the supply chain, operations, and other
strategic decision-making sectors.
Savings that can be realized because of increased business process
efficiencies and optimizations.
Greater marketing insights and information for product creation
can come from a better understanding of client demands, behaviour, and
sentiment.
Risk management tactics that are improved and more informed as a

result of huge data sample sizes [15].
Big Data Analytics Challenges

Despite the numerous advantages of utilizing big data analytics, it is not
without its drawbacks [16]:
• Data accessibility. Data storage and processing grow increasingly
difficult as the number of data increases. To ensure that less
experienced data scientists and analysts can use big data, it must
be appropriately stored and managed.
• Ensuring data quality. Data quality management for big data
necessitates a significant amount of time, effort, and resources
due to the large volumes of data coming in from multiple sources
and in varied forms. it.
• Data protection. Large data systems pose unique security
challenges due to their complexity. It might be difficult to
properly handle security risks within such a sophisticated big data
ecosystem.
• Choosing the appropriate tools. Organizations must know how
to choose the appropriate tool that corresponds with users’ needs
and infrastructure from the huge diversity of big data analytics
tools and platforms available on the market.
Figure 5: Benefits of big-data analytics [15].

• Some firms are having difficulty filling the gaps due to a probable
lack of internal analytics expertise and the high cost of acquiring
professional data scientists and engineers.
The difference between Data Analytics and Big Data

1) Nature: Let’s look at an example of the key distinction between
Big Data and Data Analytics. Data Analytics is similar to a book
where you can discover solutions to your problems; on the other
hand, Big Data can be compared to a large library with all the
answers to all the questions, but it’s tough to locate them.
2) Structure of Data: Data analytics reveals that the data is already
structured, making it simple to discover an answer to a question.
However, Big Data is a generally unstructured set of data that
must be sorted through to discover an answer to any question
and processing such massive amounts of data is difficult. To gain
useful insight from Big Data, a variety of filters must be used.
3) Tools used in Big Data vs. Data Analytics: Because the data to be
analyzed is already structured and not difficult, simple statistical
and predictive modelling tools will be used. Because processing
the vast volume of Big Data is difficult, advanced technological
solutions such as automation tools or parallel computing tools
will be required to manage it. More information about Big Data
Tools can be found here.
4) Type of Industry using Big Data and Data Analytics: Industries
such as IT, travel, and healthcare are among the most common
users of data analytics. Using historical data and studying prior
trends and patterns, Data Analytics assists these businesses in
developing new advances. Simultaneously, Big Data is utilized
by a variety of businesses, including banking, retail, and others.
In a variety of ways, big data assists these industries in making
strategic business decisions.
Application of Data Analytics and Big Data
Data is the foundation for all of today’s decisions. Today, no choices or
actions can be taken without the data. To achieve success, every company
now employs a data-driven strategy. Data Scientists, Data Experts, and other
data-related careers abound these days.
Job Responsibilities of Data Analysts

1)Analyzing Trends and Patterns: Data analysts must foresee and
predict what will happen in the future, which can be very useful
in company strategic decision-making. In this situation, a data
analyst must recognize long-term trends [17]. He must also give
precise recommendations based on the patterns he has discovered.
2) Creating and Designing Data Report: A data scientist’s reports are
a necessary component of a company’s decision-making process.
Data scientists will need to construct the data report and design it
in such a way that the decision-maker can understand it quickly.
Data can be displayed in a variety of ways, including pie charts,
graphs, charts, diagrams, and more. Depending on the nature of
the data to be displayed, data reporting can also be done in the
form of a table.
3) Deriving Valuable Insights from the Data: To benefit the
organizations, Data Analysts will need to extract relevant and
meaningful insights from the data package. The company will
be able to use those valuable and unique insights to make the
greatest decision for its long-term growth.
4) Collection, Processing, and Summarizing of Data: A Data Analyst
must first collect data, then process it using the appropriate tools,
and finally summarize the information such that it is easily
comprehended. The summarized data can reveal a lot about the
trends and patterns that are used to forecast and predict things.
Job Responsibilities of Big Data Professionals
1) Analyzing Real-time Situations: Big Data professionals are in
high demand for analyzing and monitoring scenarios that occur in
real-time. It will assist many businesses in taking immediate and
timely action to address any issue or problem, as well as capitalize
on the opportunity [18]. Many businesses may cut losses, boost
earnings, and become more successful this way.
2) Building a System to Process Large Scale Data: Processing
large amounts of data promptly is difficult. Unstructured data
that cannot be processed by a simple tool is sometimes referred
to as Big Data. A Big Data Professional must create a complex
technological tool or system to handle and analyze Big Data to
make better decisions [19].
3) Detecting Fraud Transactions: Fraud is on the rise, and it is

critical to combat the problem. Big Data experts should be able
to spot any potentially fraudulent transactions. Many sectors,
particularly banking, have important duties in this area. Every
day, many fraudulent transactions occur in financial sectors, and
banks must act quickly to address this problem. People will lose
trust in the financial system if they continue to save their hard-
earned money in banks.
CONCLUSIONS
Gradually, the business sector is relying more on its development on data
science. A tremendous amount of data is used to describe the behaviour
of complex systems, anticipate the output of processes, and evaluate this
output. Based on what we discussed in this essay, it can be stated that Big-
data analytics is the cutting-edge methodology in data science alongside
every other technological aspect, and studying comprehensively this major,
would be essential for further development.
Several methods and software are commercially available for analyzing
big-data sets. Each of them can relate to technology, business, or social
media. Further studies using analyzing software could enhance the depth of
the knowledge reported and validate the results.
REFERENCES
1. Siegfried, P. (2017) Strategische Unternehmensplanung in jungen
KMU—Probleme and Lösungsansätze. de Gruyter/Oldenbourg Verlag,
Berlin.
2. Siegfried, P. (2014) Knowledge Transfer in Service Research—Service
Engineering in Startup Companies. EUL-Verlag, Siegburg.
3. Divesh, S. (2017) Proceedings of the VLDB Endowment. Proceedings
of the VLDB Endowment, 10, 2032-2033.
4. Su, X. (2012) Introduction to Big Data. In: Opphavsrett: Forfatter
og Stiftelsen TISIP, Institutt for informatikk og e-læring ved NTNU,
Zürich, Vol. 10, Issue 12, 2269-2274.
5. Siegfried, P. (2015) Die Unternehmenserfolgsfaktoren und deren kausale
Zusammenhänge. In: Zeitschrift Ideen-und Innovationsmanagement,
Deutsches Institut für Betriebs-wirtschaft GmbH/Erich Schmidt Verlag,
Berlin, 131-137. https://doi.org/10.37307/j.2198-3151.2015.04.04
6. Gandomi, A. and Haider, M. (2015) Beyond the Hype: Big Data
Concepts, Methods, and Analytics. International Journal of
Information Management, 35, 137-144. https://doi.org/10.1016/j.
ijinfomgt.2014.10.007
7. Lembo, D. (2015) An Introduction to Big Data. In: Application of Big
Data for National Security, Elsevier, Amsterdam, 3-13. https://doi.
org/10.1016/B978-0-12-801967-2.00001-X
8. Siegfried, P. (2014) Analysis of the Service Research Studies in the
German Research Field, Performance Measurement and Management.
Publishing House of Wroclaw University of Economics, Wroclaw,
Band 345, 94-104.
9. Cheng, O. and Lau, R. (2015) Big Data Stream Analytics for Near Real-
Time Sentiment Analysis. Journal of Computer and Communications,
3, 189-195. https://doi.org/10.4236/jcc.2015.35024
10. Abu-salih, B. and Wongthongtham, P. (2014) Chapter 2. Introduction
to Big Data Technology. 1-46.
11. Sharma, S. and Mangat, V. (2015) Technology and Trends to Handle
Big Data: Survey. International Conference on Advanced Computing
and Communication Technologies, Haryana, 21-22 February 2015,
266-271. https://doi.org/10.1109/ACCT.2015.121
12. Davenport, T.H. and Dyché, J. (2013) Big Data in Big Companies.
Baylor Business Review, 32, 20-21. http://search.proquest.com/docv
iew/1467720121?accountid=10067%5Cnhttp://sfx.lib.nccu.edu.tw/
sfxlcl41?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:jou
rnal&genre=article&sid=ProQ:ProQ:abiglobal&atitle=VIEW/REVIE
W:+BIG+DATA+IN+BIG+COMPANIES&title=Bay
13. Riahi, Y. and Riahi, S. (2018) Big Data and Big Data Analytics:
Concepts, Types and Technologies. International Journal of Research
and Engineering, 5, 524-528. https://doi.org/10.21276/ijre.2018.5.9.5
14. Verma, J.P. and Agrawal, S. (2016) Big Data Analytics: Challenges
and Applications for Text, Audio, Video, and Social Media Data.
International Journal on Soft Computing, Artificial Intelligence and
Applications, 5, 41-51. https://doi.org/10.5121/ijscai.2016.5105
15. Begoli, E. and Horey, J. (2012) Design Principles for Effective
Knowledge Discovery from Big Data. Proceedings of the 2012 Joint
Working Conference on Software Architecture and 6th European
Conference on Software Architecture, WICSA/ECSA, Helsinki, 20-24
August 2012, 215-218. https://doi.org/10.1109/WICSA-ECSA.212.32
16. Najafabadi, M.M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N.,
Wald, R. and Muharemagic, E. (2015) Deep Learning Applications and
Challenges in Big Data Analytics. Journal of Big Data, 2, 1-21. https://
doi.org/10.1186/s40537-014-0007-7
17. Bätz, K. and Siegfried, P. (2021) Complexity of Culture and
Entrepreneurial Practice. International Entrepreneurship Review, 7,
61-70. https://doi.org/10.15678/IER.2021.0703.05
18. Bockhaus-Odenthal, E. and Siegfried, P. (2021) Agilität über
Unternehmensgrenzen hinaus—Agility across Boundaries, Bulletin of
Taras Shevchenko National University of Kyiv. Economics, 3, 14-24.
https://doi.org/10.17721/1728-2667.2021/216-3/2
19. Kaisler, S.H., Armour, F.J. and Espinosa, A.J. (2017) Introduction to Big
Data and Analytics: Concepts, Techniques, Methods, and Applications
Mini Track. Proceedings of the Annual Hawaii International Conference
on System Sciences, Hawaii, 4-7 January 2017, 990-992. https://doi.
org/10.24251/HICSS.2017.117
Chapter 8
Big Data Usage in the Marketing

Information System
Alexandre Borba Salvador, Ana Akemi Ikeda

ABSTRACT
Data generation, storage capacity, processing power and analytical capacity
increase had created a technological phenomenon named big data that could
create big impact in research and development. In the marketing field, the use
of big data in research can represent a deep dive in consumer understanding.
This essay discusses the big data uses in the marketing information system
and its contribution for decision-making. It presents a revision of main
concepts, the new possibilities of use and a reflection about its limitations.
Keywords: Big Data, Marketing Research, Marketing Information System
Citation: Salvador, A. and Ikeda, A. (2014), “Big Data Usage in the Marketing Infor-
mation System”. Journal of Data Analysis and Information Processing, 2, 77-85. doi:
10.4236/jdaip.2014.23010.
INTRODUCTION
A solid information system is essential to obtain relevant data for the decision-
making process in marketing. The more correct and relevant the information
is, the greater the probability of success is. The 1990s was known as the decade
of the network society and the transactional data analysis [1] . However,
in addition to this critical data, there is a great volume of less structured
information that can be analyzed in order to find useful information [2] . The
growth of generation, storage capacity, processing power and data analysis
provided a technological phenomenon called big data. This phenomenon
would cause great impacts on studies and lead to the development of
solutions in different areas. In marketing, big data research can represent the
possibility of a deep understanding of the consumer behavior, through their
profile monitoring (geo-demographic, attitudinal, behavioral), the statement
of their areas of interest and preferences, and monitoring of their purchase
behavior [3] [4] . The triangulation of the available data in real time with
information previously stored and analyzed would enable the generation of
insights that would not be possible through other techniques [5] .
However, in order to have big data information correctly used by
companies, some measures are necessary, such as investment on people
qualification and equipment. More than that, the increase of information
access may generate ethic-related problems, such as invasion of privacy and
redlining. It may affect research as well, as in cases where information could
be used without consent of the surveyed.
Predictive analytics are models that seek to predict the consumer
behavior through data generated by their purchase and/or consumption
activities and with the advent of big data, predictive analytics grow in
importance to understand this behavior from the data generated in on-line
interactions among these people. The use of predictive systems can also be
controversial as exemplified by the case of American chain Target, which
identified the purchase behavior of women at the early stage of pregnancy
and sent a congratulation letter to a teenage girl who had not yet informed
her parents about the pregnancy. The case generated considerable negative
repercussions and the chain suspended the action [4] .
The objective of this essay is to discuss the use of big data in the context
of marketing information systems, present new possibilities resulting from
its use, and reflect on its limitations. For that, the point of view of researchers
and experts will be explored based on academic publications, which will
Big Data Usage in the Marketing Information System 171
be analyzed and confronted so we may, therefore, infer conclusions on the

subject.
THE USE OF INFORMATION ON THE

DECISION-MAKING PROCESS IN MARKETING
The marketing information system (MIS) was defined by Cox and Good
(1967, p. 145) [6] as a series of procedures and methods for the regular,
planned collection, analysis and presentation of information for use in
making marketing decisions. For Berenson (1969, p. 16) [7] , the MIS would
be an interactive structure of people, equipment, methods and controls,
designed to create a flow of information able to provide an acceptable
base for the decision-making process in marketing. The need for its
implementation would derive from points that have not changed yet: 1) the
increase in business complexity would demand more information and better
performance; 2) the life cycle of products would be shortened, requiring
more assertiveness from marketing managers to collect profits in shorter
times; 3) companies would become so large that the lack of effort to create
a structured information system would make its management impractical;
4) business would demand rapid decisions and therefore, in order to support
decision making, an information system would be essential for marketing
areas; 5) although an MIS is not dependent on computers, the advances in
hardware and software technologies would have spread its use in companies,
and not using its best resources would represent a competitive penalty [7] .
The data supplying an MIS can be structured or non-structured regarding
its search mechanisms and internal (company) or external (micro and macro
environment) regarding its origin. The classic and most popular way of
organizing it is through sub-systems [8] -[10] . The input and processing
sub-systems of an MIS are the internal registration sub-system (structured
and internal information), marketing intelligence sub-system (information
from secondary sources, non-structured and from external origins), and
the marketing research sub-system (information from primary sources,
structured, from internal or external origins, generated from a research
question).
BIG DATA
The term big data applies to information that could not be processed using
traditional tools or processes. According to an IBM [11] report, the three
characteristics that would define big data are volume, speed and variety, as
together they would have created the need for new skills and knowledge in
order to improve the ability to handle the information (Figure 1).
The Internet and the use of social media have transferred the
power of creating content to users, greatly increasing the generation of
information on the Internet. However, this represents a small part of the
generated information. Automated sensors, such as RFID (radio-frequency
identification), multiplied the volume of collected data, and the volume
of stored data in the world is expected to jump from 800,000 petabytes
(PB) in 2000 to 35 zettabytes (ZB) in 2020. According to IBM, Twitter
would generate by itself over 7 terabytes (TB) of data a day, while some
companies would generate terabytes of data in an hour, due to its sensors
and controls. With the growth of sensors and technologies that encourage
social collaboration through portable devices, such as smartphones, the data
became more complex, due to its volume and different origins and formats,
such as files originating from automatic control, pictures, books, reviews in
communities, purchase data, electronic messages and browsing data. The
traditional idea of data speed would consider its retrieval, however, due to
the great number of sensors capturing information in real time, the concern
with the capture and information analysis speed emerges, leading, therefore,
to the concept of flow.
Figure 1. Three big data dimension. Source: Adapted from Zikopoulos and Ea-
ton, 2012.
The capture in batches is replaced by the streaming capture. Big data,

therefore, regards to a massive volume of zettabytes information rather than
terabytes, captured from different sources, in several formats, and in real
time [11] .
A work plan with big data should take three main elements into
consideration: 1) collection and integration of a great volume of new data
for fresh insights; 2) selection of advanced analytical models in order to
automate operations and predict results of business decisions; and 3)
creation of tools to translate model outputs into tangible actions and train
key employees to use these tools. Internally, the benefits of this work plan
would be a greater efficiency of the corporation since it would be driven
by more relevant, accurate, timely information, more transparency of the
operation running, better prediction and greater speed in simulations and
tests [12] .
Another change presented by big data is in the ownership of information.
The great information storages were owned only by governmental
organizations and major traditional corporations. Nowadays, new
corporations connected to technology (such as Facebook, Google, LinkedIn)
hold a great part of the information on people, and the volume is rapidly
increasing. Altogether, this information creates a digital trail for each person
and its study can lead to the identification of their profile, preferences and
even prediction of their behavior [5] .
Within business administration, new uses for the information are identified
every day, with promises of benefits for operations (productivity gains),
finance (control and scenario predictions), human resources (recruitment
and selection, salary, identification of retention factors) and research and
development (virtual prototyping and simulations). In marketing, the
information on big data can help to both improve information quality for
strategic planning in marketing and predict the definition of action programs.
USE OF BIG DATA IN THE MARKETING

INFORMATION SYSTEM
Marketing can benefit from the use of big data information and many
companies and institutes are already being structured to offer digital research
and monitoring services. The use of this information will be presented
following the classical model of marketing information system proposed by
Kotler and Keller (2012) [10] .
Input-Sub-Systems
Internal Reports
Internal reports became more complete and complex, involving information
and metrics generated by the company’s digital proprieties (including
websites and fanpages), which would also increase the amount of information
on consumers, reaching beyond the data on customer profile. With the
increase of information from different origins and in different formats, a
richer internal database becomes the research source for business, markets,
clients and consumers insights, in addition to internal analysis.
Marketing Intelligence
If in one hand the volume of information originated from marketing
intelligence increases, on the other hand, it is concentrated on an area
with more structured search and monitoring tools, with easier storage and
integration. Reading newspapers, magazines and sector reports gains a new
dimension with the access to global information in real time, changing the
challenge of accessing information to selection of valuable information,
increasing, therefore, the value of digital clipping services. The monitoring
of competitors gains a new dimension since brand changes, whether local or
global, can be easily followed up. The services of brand monitoring increase,
with products such as GNPD by Mintel [13] and the Buzzz Monitor by e.
Life [14] or SCUP and Bluefin.
Marketing Research
Since the Internet growth and virtual communities increase, studying online
behavior became, at the same time, an opportunity and a necessity. Netnography
makes use of ethnography sources when proposing to study group behavior
through observation of their behavior in their natural environment. In this
regard, ethnography (and netnography) has the characteristic of minimizing
the behavior changes setbacks by not moving the object of study from its
habitat, as many other study groups do. However, academic publications
have not reached an agreement on technique application and analysis depth
[15] -[17] . Kozinets (2002, 2006) [16] [17] proposes a deep study, in which
the researcher needs to acquire great knowledge over the object group and
monitor it for long periods, while Gerbera (2008) [15] is not clear about
such need of deep knowledge of the technique, enabling the understanding
of that which could be similar to a content analysis based on digital data. For
the former, just as ethnography, the ethical issues become more important
as the researcher should ask for permission to monitor the group and make
their presence known; and, for the latter, netnography would not require
such observer presentation from public data collected. The great volume of
data captured by social networks could be analyzed using netnography.
One of the research techniques that have been gaining ground in the
digital environment is the content analysis due to, on one hand, the great
amount of data available for analysis on several subjects, and, on the other
hand, the spread of free automated analysis tools, such as Many Eyes by
IBM [18] , which offers cloud resources on terms, term correlation, scores
and charts, among others. The massive volume of information of big data
provides a great increase in the sample, and, in some cases, enables the
population research, with “n = all” [4] .
Storage, Retrieval and Analysis

With the massive increase of the information volume and complexity,
the storage, retrieval and analysis activities are even more important with
big data. Companies that are not prepared to deal with the challenge find
support in outsourcing the process [11] . According to Soat (2013) [19] ,
the attribution of scores for information digitally available (e-scores) would
be one of the ways of working with information from different origins,
including personal data (data collected from fidelity programs or e-mail
messages), browsing data collected through cookies, and outsourced data,
collected from financing institutes, censuses, credit cards. The information
analysis would enable the company to develop the client’s profile and
present predictive analyses that would guide marketing decisions, such as
identification of clients with greater lifetime value.
Information for the Decision-Making Process in Marketing

The marketing information system provides information for strategic
(structure, segmentation and positioning) and operational (marketing mix)
decision making. The use of big data in marketing will analyzed below
under those perspectives.
Segmentation and Positioning

For Cravens and Piercy (2008) [20] , a segmentation strategy includes
market analysis, identification of the market to be segmented, evaluation on
how to segment it, definition of strategies of micro segmentation. A market

analysis can identify segments that are unacknowledged or underserved by
the competitors. To be successful, a segmentation strategy needs to seek
identifiable and measurable, substantial, accessible, responsive and viable
groups.
Positioning can be understood as the key characteristic, benefit or image
that a brand represents for the collective mind of the general public [21] . It
is the action of projecting the company’s offer or image so that it occupies
a distinctive place in the mind of the target public [10] . Cravens and Piercy
(2008, p. 100) [20] connect the segmentation activity to the positioning
through identification of valuable opportunities within the segment.
Segmenting means identifying the segment that is strategically important to
the company, whereas positioning means occupying the desired place within
the segment.
Digital research and monitoring tools enable studies on the consumer
behavior to be used in behavioral segmentation. The assignment of scores
and the use of advanced analyses help to identify and correlate variables,
define predictive algorithmics to be used in market dimensioning and
lifetime value calculations [19] [22] . The netnographical studies are also
important sources to understand the consumer behavior and their beliefs
and attitudes, providing relevant information to generate insights and define
brand and product positioning.
Product
From the positioning, the available information should be used to define
the product attributes, considering the value created for the consumer.
Information on consumer preferences and manifestations in communities
and forums are inputs for the development and adjustment of products, as
well as for the definition of complementary services. The consumer could
also participate in the product development process by offering ideas and
evaluations in real time.
The development of innovation could also benefit from big data, both
by surveying insights with the consumers and by using the information to
develop the product, or even to improve the innovation process through
the use of information, benefiting from the history of successful products,
analyses of the process stages or queries to an idea archive [23] . As an
improvement to the innovation process, the studies through big data would
enable the replication of Cooper’s studies in order to define a more efficient

innovation process, by exploring the boundary between the marketing
research and the research in marketing [24] .
Distribution
Internal reports became more complete and complex, involving information
and metrics generated by the company’s digital proprieties (including
websites and fanpages), which would also increase the amount of information
on consumers, reaching beyond the data on customer profile. With the
increase of information from different origins and in different formats, a
richer internal database becomes the research source for business, markets,
clients and consumers insights, in addition to internal analysis.
In addition to the browsing location in the digital environment and the
monitoring of visitor indicators, exit rate, bounce rate and time per page,
the geolocation tools enable the monitoring of the consumers’ physical
location and how they commute. More than that, the market and consumer
information from big data enables to assess, in a more holistic manner, the
variables that affect the decisions on distribution and location [25] .
Communication
Big data analysis enables the emergence of new forms of communication
research through the observation on how the audience interacts with the
social networks. From their behavior analysis, new insights on their
preferences and idols [3] may emerge to define the concepts and adjust
details on the campaign execution. Moreover, the online interaction while
displaying offline actions of brands enables the creation and follow up of
indicators to monitor the communication [3] [26] , whether quantitative or
qualitative.
The increase of information storage, processing and availability enables
the application of the CRM concept to B2C clients, involving the activities
of gathering, processing and analyzing information on clients, providing
insights on how and why clients shop, optimizing the company processes,
facilitating the client-company interaction, and offering access to the client’s
information to any company.
Price
Even offline businesses will be strongly affected by the use of online prices
information. A research by Google Shopper Marketing Council [27] ,
published in April, 2013, shows that 84% of American consumers consult

their smartphones while shopping in physical stores and 54% use them to
compare prices. According to Vitorino (2013) [4] , the price information
available in real time, together with the understanding of the consumers’
opinion and factors of influence (stated opinions, comments on experiences,
browsing history, family composition, period since last purchase, purchase
behavior), combined with the use of predictive algorithmics would change
the dynamics, and could, in the limit, provide inputs for a customized
decision making on price every time.
LIMITATIONS
Due to the lack of a culture that cultivates the proper use of information and
to a history of high costs for storage space, a lot of historical information
was lost or simply not collected at all. A McKinsey study with retail
companies observed that the chains were not using all the potential of
the predictive systems due to the lack of: 1) historical information; 2)
information integration; and 3) minimum standardization between the
internal and external information of the chain [28] -[30] . The greater the
historical information, the greater the accuracy of the algorithm, provided
that the environment in which the system is implemented remains stable.
Biesdorf, Court and Willmott (2013) [12] highlight the challenge of
integrating information from different functional systems, legacy systems
and information generated out of the company, including information from
the macro environment and social networks.
Not having qualified people to guide studies and handle systems and
interfaces is also a limiting factor for research [23] , at least in a short term.
According to Gobble (2013) [23] McKinsey report identifies the need for
190,000 qualified people to work in data analysis-related posts today. The
qualification of the front line should follow the development of user-friendly
interfaces [12] . In addition to the people directly connected to the analytics,
Don Schults (2012) [31] still highlights the need for people with “real life”
experience, able to interpret the information generated by the algorithms. “If
the basic understanding of the customer isn’t there, built into the analytical
models, it’s really doesn’t matter how many iterations the data went through
or how quickly. The output is worthless (SCHULTZ, 2012, p. 9).”
The management of clients in a different manner through CRM already
faces a series of criticism and limitations. Regarding the application of CRM
for service marketing, its limitations would lie in the fact that a reference
based only on the history may not reflect the client’s real potential; the
unequal treatment of clients could generate conflicts and dissatisfaction of
clients not listed as priorities; and ethical issues involving privacy (improper
information sharing) and differential treatment (such as redlining). These
issues can be also applied in a bigger dimension in discussions about the use
of information from big data in marketing research and its application on
clients and consumers.
The predictive models are based on the fact that the environment where
the analyzing system is implemented remains stable, which, by itself, is a
limitation to the use of information. In addition to it and to the need of
investing in a structure or expending on outsourcing, the main limitations
in the use of big data are connected to three main factors: data shortage
and inconsistence, qualified people, and proper use of the information. The
full automation of the decision due to predictive models [5] also represents
a risk, since that no matter how good a model is, it is still a binary way
of understanding a limited theoretical situation. At least for now, the
analytical models would be responsible for performing the analyses and
recommendations, but the decisions would still be the responsibility of
humans.
Nuan and Domenico (2013) [5] have also emphasized that people’s
behavior and their relationships in social networks may not accurately
reflect their behavior offline, and the first important thing to do would be to
increase the understanding level of the relation between online and offline
social behavior. However, if on one hand people control the content of the
intentionally released information in social networks, on the other hand, a
great amount of information is collected invisibly, compounding their digital
trail. The use of information without the awareness and permission of the
studied person involves the ethics in research [15] -[17] . Figure 2 shows
a suggestion of continuum between the information that the clients would
make available wittingly and the information make available unwittingly
to the predictive systems. The consideration of the ethics issues raised by
Kozinets (2006) [17] , Nunan and Domenico (2013) [15] , and reinforces
the importance of increasing the clients’ level of awareness regarding the
use of their information or ensuring the non-customization of the analysis of
information obtained unwittingly by the companies.
FINAL CONSIDERATIONS
This study discussed the use of big data in the context of marketing
information system, and, what was clear is that we are still in the beginning
of a journey of understanding its possibilities and use, and we can observe the
great attention generated by the subject and the increasing ethical concern.
As proposed by Nunan and Domenico (2013) [5] , the self-governance via
ESOMAR (European Society for Opinion and Market Research) [32] is an
alternative to fight the abuses and excesses and enable the good use of the
information. Nunan and Di Domenico (2013) [5] propose to include in the
current ESOMAR [32] rules the right to be forgotten (possibility to request
deletion of history), the right to have the data expired (complementing
the right to be forgotten, the transaction data could also expire), and the
ownership of a social graph (an individual should be aware of the information
collected about them).
Non-confidential information
Greater awareness and consent in providing information
open data from websites and personal pages

open comments
information on interests (Like, Follow, RT)
aware participation in surveys (online surveys)
personal information
interests
relationships
geographical location (GPS)
trail until the purchase
online behavior
purchase registry
confidential documents
e-mail content
bank data
Confidential information
Greater unawareness and non-consent in providing information
Figure 2: Continuum between the awareness and non-awareness regarding the

use of information. Source: authors.
In marketing communication, the self-governance in Brazil has showed

positive results, such as the examples in the liquor industry and kid’s
food industry, which, upon the pressure of public opinion, have adopted
restrictive measures to repress abuses and maintain the communication
of categories [33] . Industries such as the cigarette are opposite examples
of how the excess has led to great restrictions to the categories. As in the
prisoners’ dilemma [34] , the self-governance forces a solution in which all
participants have to give up on their best short-term individual options for
the good of the group in a long term (Figure 3).
On the other hand, if the consumer’s consent in releasing the use of
their information would solve the ethical issues, the companies would
never have so much power to create value for their clients and consumers.
Recovering the marketing application proposed in “Broadening the concept
of marketing” [35] , the exchange of consent release could be performed by
offering a major non-pecuniary value. This value offer could be the good
use of the information to generate services or new proposals that increase
the value perceived by the client [10] . Currently, many mobile applications
offer services to consumers, apparently free of charge, in exchange for
their audience for advertisements and access to their information in social
networks. By understanding which service, consistent with its business
proposal, the consumer sees the value in, and making this exchange clear,
the service and consent of the information use could be a solution to access
information in an ethical manner.
From the point of view of marketing research, the development of
recovery systems and the analyses of great volumes of non-structured
information could lead to the understanding of consumer behaviors. Issues
regarding the findings and understanding of the consumers in marketing
research are addressed qualitatively. However, due to the volume of cases,
could the studies, through big data, provide at the same time the understanding
on the consumer and the measurement of the groups with this behavior? A
suggestion for the following research
All people exceed. All people exceed.

I exceed. I do not exceed.
Excesses in invasion of privacy and excesses in communication. Excesses in invasion of privacy and excesses in communication.
High investment and results shared among all. Impaired Few information and visibility for those who do not exceed.
society. Society and those who do not exceed are impaired.
All people do not exceed. All people do not exceed.

I exceed. I do not exceed.
No invasion of privacy and little relevant communication. Low No invasion of privacy and little relevant communication. Low
investments in communication. High visibility and results to investments in communication and results shared among all.
those who exceed. Society and those who exceed are Society is benefited.
benefited. Those who do not exceed are impaired.
Figure 3: Free exercise of the prisoners’ dilemma application. Source: Authors,

based on Pindick and Rubinfeld (1994).
Alexandre Borba Salvador, Ana Akemi Ikeda would be to study the
combination of the qualitative and quantitative research objectives with the
use of big data and analytical systems in understanding consumer behavior
and measurement of group importance.
REFERENCES
1. Chow-White, P.A. and Green, S.E. (2013) Data Mining Difference
in the Age of Big Data: Communication and the Social Shaping of
Genome Technologies from 1998 to 2007. International Journal of
Communication, 7, 556-583.
2. ORACLE: Big Data for Enterprise. http://www.oracle.com/br/
technologies/big-data/index.html
3. Paul, J. (2012) Big Data Takes Centre Ice. Marketing, 30 November
2012.
4. Vitorino, J. (2013) Social Big Data. S?o Paulo, 1-5. www.elife.com.br
5. Nunan, D. and Domenico, M.Di. (2013) Market Research and the Ethics
of Big Data Market Research and the Ethics of Big Data. International
Journal of Market Research, 55, 2-13.
6. Cox, D. and Good, R. (1967) How to Build a Marketing Information
System. Harvard Business Review, May-June, 145-154. ftp://donnees.
admnt.usherbrooke.ca/Mar851/Lectures/IV
7. Berenson, C. (1969) Marketing Information Systems. Journal of
Marketing, 33, 16. http://dx.doi.org/10.2307/1248668
8. Chiusoli, C.L. and Ikeda, A. (2010) Sistema de Informa??o de
Marketing (SIM): Ferramenta de apoio com aplica??es à gest?o
empresarial. Atlas, S?o Paulo.
9. Kotler, P. (1998) Administra??o de marketing. 5th Edition, Atlas, S?o
Paulo.
10. Kotler, P. and Keller, K. (2012) Administra??o de marketing. 14th
Edition, Pearson Education, S?o Paulo.
11. Zikopoulos, P. and Eaton, C. (2012) Understanding Big Data: Analytics
for Enterprise Class Hadoop and Streaming Data. McGraw Hill, New
York, 166. Retrieved from Malik, A.S., Boyko, O., Atkar, N. and
Young, W.F. (2001) A Comparative Study of MR Imaging Profile of
Titanium Pedicle Screws. Acta Radiologica, 42, 291-293. http://dx.doi.
org/10.1080/028418501127346846
12. Biesdorf, S., Court, D. and Willmott, P. (2013) Big Data: What’s Your
Plan? McKinsey Quarterly, 40-41.
13. MINTEL. www.mintel.com
14. E. Life. www.elife.com.br
15. Gebera, O.W.T. (2008) La netnografía: Un método de investigación en

Internet. Quaderns Digitals: Revista de Nuevas Tecnologías y Sociedad,
11. http://dialnet.unirioja.es/servlet/articulo?codigo=3100552
16. Kozinets, R. (2002) The Field behind the Screen: Using Netnography
for Marketing Research in Online Communities. Journal of Marketing
Research, 39, 61-72. http://dx.doi.org/10.1509/jmkr.39.1.61.18935
17. Kozinets, R.W. (2006) Click to Connect: Netnography and Tribal
Advertising. Journal of Advertising Research, 46, 279-288. http://
dx.doi.org/10.2501/S0021849906060338
18. Many Eyes. http://www.manyeyes.com/software/analytics/manyeyes/
19. Soat, M. (2013) E-SCORES: The New Face of Predictive Analytics.
Marketing Insights, September, 1-4.
20. Cravens, D.W. and Piercy, N.F. (2008) Marketing estratégico. 8th
Edition, McGraw Hill, S?o Paulo.
21. Crescitelli, E. and Shimp, T. (2012) Comunica??o de Marketing:
Integrando propaganda, promo??o e outrs formas de divulga??o.
Cengage Learning, S?o Paulo.
22. Payne, A. and Frow, P. (2005) A Strategic Framework for Customer
Relationship Management. Journal of Marketing, 69, 167-176. http://
dx.doi.org/10.1509/jmkg.2005.69.4.167
23. Gobble, M.M. (2013) Big Data: The Next Big Thing in Innovation.
Research-Technology Management, 56, 64-67. http://dx.doi.org/10.54
37/08956308X5601005
24. Cooper, R.G. (1990) Stage-Gate Systems: A New Tool for Managing
New Products, (June).
25. Parente, J. (2000) Varejo no Brasil: Gest?o e Estratégia. Atlas, S?o
Paulo.
26. Talbot, D. (2011) Decoding Social Media Patterns in Tweets A Social-
Media Decoder. Technology Review, December 2011.
27. Google Shopper Marketing Agency Council (2013) Mobile In-Store
Research: How Is Store Shoppers Are Using Mobile Devices, 37.
http://www.marcresearch.com/pdf/Mobile_InStore_Research_Study.
pdf
28. Bughin, J., Byers, A. and Chui, M. (2011) How Social Technologies
Are Extending the Organization. McKinsey Quarterly, 1-10. http://
bhivegroup.com.au/wp-content/uploads/socialtechnology.pdf
29. Bughin, J., Livingston, J. and Marwaha, S. (2011) Seizing the Potential
of “Big Data.” McKinsey …, (October). http://whispersandshouts.
typepad.com/files/using-big-data-to-drive-strategy-and-innovation.pdf
30. Manyika, J., Chui, M., Brown, B. and Bughin, J. (2011) Big Data:
The Next Frontier for Innovation, Competition, and Productivity. 146.
www.mckinsey.com/mgi
31. Schultz, D. (2012) Can Big Data Do It All?? Marketing News,
November, 9.
32. ESOMAR. http://www.esomar.org/utilities/news-multimedia/video.
php?idvideo=57
33. CONAR. Conselho Nacional de Auto-regulamenta??o Publicitária.
http://www.conar.org.br/
34. Pindyck, R.S. and Rubinfeld, D.L. (1994) Microeconomia. Makron
Books, S?o Paulo.
35. Kotler, P. and Levy, S.J. (1969) Broadening the Concept of Marketing.
Journal of Marketing, 33, 10-15. http://dx.doi.org/10.2307/1248740
Chapter 9
Big Data for Organizations: A Review
Pwint Phyu Khine1, Wang Zhao Shun1,2

1
2
Beijing Key Laboratory of Knowledge Engineering for Material Science, Beijing, China
ABSTRACT
Big data challenges current information technologies (IT landscape) while
promising a more competitive and efficient contributions to business
organizations. What big data can contribute to is what organizations have
been wanted for a long time ago. This paper presents the nature of big data
and how organizations can advance their systems with big data technologies.
By improving the efficiency and effectiveness of organizations, people can
benefit the can take advantages of a more convenient life contributed by
Information Technology.
Citation: Khine, P. and Shun, W. (2017), “Big Data for Organizations: A Review”.
Journal of Computer and Communications, 5, 40-48. doi: 10.4236/jcc.2017.53005.
Keywords: Big Data, Big Data Models, Organization, Information System
INTRODUCTION
Business organizations have been using big data to improve their competitive
advantages. According to McKinsey [1], organizations which can fully
apply big data get competitive advantages over its competitors. Facebook
users uploads hundreds of Terabytes of data each day and these social media
data are used for developing more advanced analysis which aim is to take
more value from user data. Search Engines like Google and Yahoo are
already monetizing by associating appropriate ads based on user queries
(i.e. Google use big data to give the right ads to the right user in a split
seconds). In applying information systems to improve their organization
system, most government organization left behind compared to the business
organizations [2]. Meanwhile some government already take initiative to get
the advantages of big data. E.g. Obama’s government announced investment
of more than $200 million for Big Data R & D in Scientific Foundations in
2012 [3]. Today, people are living in the data age where data become oxygen
to people as organizations are producing data more than they can handle
leading to big data era.
This paper is sectioned as follows: Section II of this paper describes
Big Data Definitions, Big Data Differences and Sources within data, Big
Data characteristics, and databases and ELT process of big data. Section
IV is mainly concerned with the relationship between big data information
systems and organizations, how big data system should be implemented and
big data core techniques for organizations. Section IV is the conclusion of
the paper.
BIG DATA FOR ORGANIZATIONS

The nature of Big Data can be expressed by studying the big data definition,
the data hierarchy and sources of big data, and its prominent characteristics,
databases and processes. According to [1], there are five organization
domains for big data to create value and based on the size of the potential
including health care, manufacturing, public sector administration, retail and
global personal location data. There are many potential organizations which
require big data solution such as scientific discovery (e.g. astronomical
organizations, weather predictions) with huge amount of data.
Big Data for Organizations: A Review 189
Big Data Definition

Big data refers to the world of digital data which becomes enormous to be
handled by traditional data handling techniques. Big data is defined in here
as “a large volume of digital data which require different kinds of velocity
based on the requirements of the application domains which has a wide
variety of data types and sources for the implementation of the big data
project depending on the nature of the organization.”
Big data can be further categorized into Big Data Science and Big data
framework [4]. Big data science is “the study of techniques covering the
acquisition, conditioning, and evaluation of big data”, whereas big data
frameworks are “software libraries along with their associated algorithms
that enable distributed processing and analysis of big data problems across
clusters of computer units.” It is also stated that “an instantiation of one or
more big data frameworks is known as big data infrastructure.”
Big Data Differences and Sources within Data

According to the basic data hierarchy as described in Table 1, different
levels of computer systems have been emerged based on the nature of the
application domains and organizations to extract the required value of the
data (required hierarchy level data). Big data, instead, try to get value from
since the “data” steps by applying big data theories and techniques regardless
of types and level of information systems.
Based on the movement of data, data can be classified into “data in
motion” and “data at rest”. “Data in motion” means data which have not
stored in a storage medium i.e. moving data such as streaming data comes
from IoTs devices. They need to control almost in real time and need
interactive controlling. “Data at rest” are data that can be retrieved from
the storage systems such as data from warehouses, RDBMS (Relational
Database Management Systems) databases, File systems e.g. HDFS (Hadoop
Distributed File Systems), etc.
Table 1: Hierarchy of data
Hierarchy Description
Data Any piece of raw information that is unprocessed e.g. name,
quality, sound, image, etc.
Information Data is processed into a useful form that become information.
e.g. employee information (data about employee)
Knowledge Information is advanced by adding more contents from hu-

man experts that become knowledge (e.g. Pension data about
employee)
Business Insight Information is extracted and used in a way that help improve the
business processes. (e.g. predicting the trends of customer buy-
ing patterns based on current information)
Traditional “Bringing data to perform operations” style is not suitable

in voluminous big data because it will definitely waste the huge amount of
computational power. Therefore, big data adopts the style of “Operations
go where data exist” to reduce computational costs which is done by using
the already well-established distributed and parallel computing technology
[5]. Big data is also different from traditional data paradigm. Traditional
data warehouses approaches map data into predefined schema and used
“Schema-on-write” approach. But when big data handle data, there is no
predefined schema. Instead, the required schema definition is retrieved from
data itself. Therefore, big data approach can be considered as “Schema-on-
Read” approach.
In information age with the proliferation of data in every corner of the
world, the sources of big data can be difficult to differentiate. Big data sourced
in the proliferation of social media, IoTs, traditional operation systems and
people involvement. The sources of big data stated in [4] are IoTs (Internet of
Things) such as sensor, social networks such as Twitter, open data permitted to
be used by government or some business organizations (e.g. twitter data) and
crowd sourcing which encourage people to provide and enter data especially
for massive scale projects (e.g. census data). The popularity, major changes
or new emergence of different organizations will create the new sources
of big data. E.g. in the past, data from social media organizations such as
Facebook, twitter, are not predicted to become a big data source. Currently,
data from mobile phones handled by telecommunication companies, and
IoTs for different scientific researches become important big data sources.
In future, transportation vehicles with Machine-to-Machine communication
(data for automobile manufacturing firms), and data from Smart city with
many interconnected IoT devices will become the big data sources because
of their involvement in people daily life.
Characteristics of Big Data

The most prominent features of big data are characterized as Vs. The first
three Vs of Big data are Volume for huge data amount, Variety for different
types of data, and Velocity for different data rate required by different kinds
of systems [6].
Volume: When the scale of the data surpass the traditional store or
technique, these volume of data can be generally labeled as the big data
volume. Based on the types of organization, the amount of data volume can
vary from one place to another from gigabytes, terabytes, petabytes, etc. [1].
Volume is the original characteristic for the emergence of big data.
Variety: Include structured data defined in specific type and structure
(e.g. string, numeric, etc. data types which can be found in most RDBMS
databases), semi-structured data which has no specific type but have
some defined structure (e.g. XML tags, location data), unstructured data
with no structure (e.g. audio, voice, etc. ) which their structures have to
be discovered yet [7], and multi-structured data which include all these
structured, semi-structured and unstructured features [7] [8]. Variety comes
from the complexity of data from different information systems of target
organization.
Velocity: Velocity means the rate of data required by the application
systems based on the target organization domain. The velocity of big data
can be considered in increasing order as batch, near real-time, real-time and
stream [7]. The bigger data volume, the more challenges will likely velocity
face. Velocity the one of the most difficult characteristics in big data to
handle [8].
As more and more organizations are trying to use big data, big data Vs
characteristics become to appear one after another such as value, veracity
and validity. Value mean that data retrieved from big data must support the
objective of the target organization and should create a surplus value for the
organization [7]. Veracity should address confidentiality in data availablefor
providing required data integrity and security. Validity means that the data
must come from valid source and clean because these big data will be
analyzed and the results will be applied in business operations of the target
organization.
Another V of data is “Viability” or Volatility of data. Viability means
the time data need to survive i.e. in a way, the data life time regardless of
the systems. Based on viability, data in the organizations can be classified as
data with unlimited lifetime and data with limited lifetime. These data also
need to be retrieved and used in a point of time. Viability is also the reason
the volume challenge occurs in organizations.
Database Systems and Extract-Load-Transform (ELT) in Big

Data
Traditional RDBMS with ACID properties?(Atomicity, Consistency,
Isolation and Durability) is only intended for structured data cannot handled
all V’s requirements of big data and cannot provide horizontal scalability,
availability and performance [9]. Therefore, NoSQL (not only SQL) databases
are need to use based on the domains of the organizations such as Mongo DB,
Couch DB for documentation databases, Neo4j for graph databases, HBase
columnar database for sparse data, etc. NoSQL database use the BASE
properties (Basically Available, Soft state, Eventual consistency). Because
big data are based on parallel computing and distributed technology, CAP
(Consistency, Availability, and Partition) theorem will affect in big data
technologies [10].
Data warehouses and data marts store valid and cleaned data by the
process of ETL (Extract-Transform-Load). Preprocessed, highly summarized
and integrated (Transformed) data are loaded into the data warehouses for
further usage [11]. Because of heterogeneous sources of big data, traditional
transformation process will charge a huge computational burden. Therefore,
big data first “Load” all the data, and then transform only the required data
based on need of the systems in the organizations. The process can change
into Extract-Load- Transform. As a result, new idea like “Data Lake” also
emerged which try to store all data generated by organizations and has
overpower of data warehouses and data mart although there are critics for
becoming a “data swamp” [12].
BIG DATA IN ORGANIZATIONS AND INFORMATION

SYSTEMS
Many different kind of organizations are now applying and implementing big
data in various types of information systems based on their organizational
needs. Information systems emerged according to the requirements of the
organizations which are based on what organizations do, how they do
and organizational goals. According to Mintzberg, five different kinds of
organization are classified based on the organization’s structure, shape and
management as (1) Entrepreneurial structure―a small startup firm, (2)
Machine Bureaucracy―medium sized manufacturing firm with definite
structure, (3) Divisionalized bureaucracy―a multi-national organization
which produces different kinds of products controlled by the central
headquarter, (4) Professional bureaucracy―an organization relys on

the efficiency of individuals such as law firms, universities, etc., and (5)
Adhocracy such as consulting firm. Different kinds of information systems
are required based on the work the target organization does.
Information systems required by the organization and the nature
of problems within them reflects the types of organizational structure.
Systems are structured procedures for the regulations of the organization
limited by organization boundary. These boundary express the relationship
between systems and its environment (organization). Information systems
collects and redistribute data within internal operations of the organization
and organization environment using the three basic simplest procedures-
inputting data, performing processing and outputting the information.
Among the organization and systems are “Business processes” which are
logically related tasks with formal rules to accomplish a specific work
which need to coordinate throughout the organization hierarchy [2]. These
organizational theories are always true regardless of old or new evolving
data methodologies.
Relationship between Organization and Information Systems

The relationship between organization and information systems are called
socio- technical effects. This socio-technical model suggests that all these
components- organizational structure, people, job tasks and Information
Technology (IT)- must be changed simultaneously to achieve the objective
of the target organization and information systems [2]. Sometimes, these
changes can result in changing business goals, relationship with people,
and business processes for target organization, blur the organizational
boundaries and cause the flattening of the organization [1] [2]. Big data
transforms traditional siloed information systems in the organizations into
digital nervous systems with information in and out of relating organizational
systems. Organization resistance to change is need to be considered in every
implementation of Information systems. The most common reason for
failure of large projects is not the failure of the technology, but organizational
and political resistance to change [2]. Big data projects need to avoid these
kind of mistake and implement based on not only from information system
perspective but also from organizational perspective.
Implementing Big Data Systems in Organizations

The work [13] provide a layered view of big data system. To make
the complexity of big data system simpler, the big data system can be
decomposed into a layered structure according to a conceptual hierarchy.
The layers are “Infrastructure Layer” with raw ICT resources, “Computing
Layer” which encapsulating various data tools into a middleware layer that
runs over raw ICT resources, and “Application layer” which exploits the
interface provided by the programming models to implement various data
analysis functions to develop various field related applications in different
organizations.
Different scholars are considering the system development life cycle
of big data system project. Based on IBM’s three-phases to build big data
projects, the work in [4] proposed a holistic view for implementing the big
data projects.
Phase 1. Planning: Involves Global Strategy Elaboration where the
main idea is that the most important thing to consider is not technology but
business objectives.
Phase 2. Implementation: This stages are divided into 1) data collecting
from major big data sources, 2) data preprocessing by data cleaning for valid
data, integrating different data types and sources, transformation (mapping
data elements from source to destination systems and reducing data into a
smaller structure (sometimes data discretization as a part of it), 3) smart
data analysis i.e. using advanced analytics to extract value from a huge set
of data, apply advanced algorithms to perform complex analytics on either
structured or unstructured data, 4) representation and visualization for
guiding the analysis process and presenting the results in a meaningful way.
Phase 3. Post implementation: This phase involves 1) actionable and
timely insight extraction stage based on the nature of organization and the
value that organization is seeking which decide whether the success and
failure of big data project, 2) Evaluation stage evaluates a Big data project,
it is stated that diverse data inputs, their quality, and expected results are
required to consider.
Based on this big data project life cycle, organization can develop their
own big data projects. The best way to implement big data projects is to use
both technologies that are before and after big data. E.g. use both Hadoop and
warehouse because they implement each other. US government considers
“all contents as data” when implementing big data projects. In digital era,
data has the power to change the world and need careful implementation.
Big Data Core Techniques for Organizations

There are generally two types of processing in big data-batch processing and
real-time processing based on the domain nature of the organization. The
fundamental of big data technology is based on MapReduce Model [14] by
Google for processing batch work load of their user data. It is based on scale
out model of the commodity servers. Later, real-time processing models
such as twitter’s Storm, Yahoo’s S4, etc. become appear because of the near-
real time, real time and stream processing requirements of organizations.
The core of MapReduce model is the power of “divide and conquer
method” by distributing the jobs on the clusters of commodity servers with
two steps (Map and Reduce) [14]. Jobs are divided and distributed over the
clusters, and the completed jobs (intermediate results) from Map phases are
sent to the reduce phase to perform required operations. In a way, In the
MapReduce paradigm, the Map function performs filtering and sorting and
Reduce function carries out grouping and aggregation operations. There are
many implementations of MapReduce algorithm which are in open source
or proprietary. Among the open source frameworks, the most prominent
one is “Hadoop” with two main components―“MapReduce Engine” and
“Hadoop Distributed File System (HDFS)”―In the HDFS cluster, files are
broken into blocks that are stored in the DataNodes. NameNode maintains
meta-data of these file blocks and keeps tracks of operations of Data Node
[7]. MapReduce provide scalability by distributed execution and reliability
by reassigning the failed jobs [9]. Other than MapReduce Engine and HDFS,
Hadoop has a wide variety of ecosystem such as Hive for warehouses,
Pig for query, YARN for resource management, Sqoop for data transfer,
Zookeeper for coordination, etc. and many others. Hadoop ecosystem will
continue to grow as new big data systems appeared according to the need of
the different organizations.
Organizations with interactive nature and high response time require
real- time processing.
Although MapReduce is dominant batch processing model, real-time
processing models are still competing with each other, each with their own
competitive advantages.
“Storm” is a prominent big data technology for Real-time processing.

The famous user of storm is Twitter. Different from MapReduce, Storm use a
topology which is a graph of spouts and bolts that are connected with stream
grouping. Storm consume data streams which are unbounded sequences of
tuples, splits the consumed streams, and processes these split data streams.
The processed data stream is again consumed and this process is repeated
until the operation is halted by user. Spout performsas a source of streams
in a topology, and Bolt consumes streams and produce new streams, as they
execute in parallel [15].
There are other real-time processing tools for Big Data such as Yahoo’s
S4 (Simple Scalable Streaming System) which is based on the combination
of actor models and MapReduce model. S4 works with Processing Elements
(PEs) that consume the keyed data events. Messages are transmitted between
PEs in the form of data events. Each PE’s state is inaccessible to other
PEs and event emission and consumption is the only mode of interaction
between PEs. Processing Nodes (PN) are the logical hosts of PEs which are
responsible for listening to the events, executing operating on the incoming
events, dispatching events with the assistance of the communication layer,
and emitting output events [16]. There is no specific winner in stream
processing models, and organizations can use appropriate data models that
are consistent with their works.
Regardless of batch or real-time, there are many open source and
proprietary software framework for big data. Open source big data framework
are Hadoop, EPCC (High Performance Computing Cluster), etc. [7].
Many other proprietary big data tools such as IBM BigInsight, Accumulo,
Microsoft Azure, etc. has been successfully used in many business areas of
different organizations. Now, big data tools and libraries are available in
other languages such as Python, R, etc. for many different kinds of specific
organizations.
CONCLUSION
Big data is a very wide and multi-disciplinary field which requires the
collaboration from different research areas and organizations from various
sources. Big data may change the traditional ETL process into Extract-
Load-Transform (ELT) process as big data give more advantages in moving
algorithms near where the data exist. Like other information systems, the
success of big data projects depend on organizational resistance to change.

Organizational structure, people, tasks and information technologies need
to change simultaneously to get the desired results. Based on the layered
view of the big data [13], big data projects can implement with step-by-
step roadmap [4]. Big data sources will vary based on the past, present and
future of the organizations and information systems. Big data have power
to change the landscape of organization and information systems because
of its different unique nature from traditional paradigms. Using big data
technologies can make organizations get overall advantage with better
efficiency and effectiveness. The future of big data will be the digital nervous
systems for organization where every possible systems need to consider the
big data as a must have technology. Data age is coming now.
ACKNOWLEDGEMENTS
I want to express my gratitude for my supervisor Professor Wang Zhao Shun
for encouraging and giving suggestions for improving my paper.
REFERENCES
1. Manyika, J., et al. (2011) Big Data: The Next Frontier for Innovation,
Competition, and Productivity. San Francisco, McKinsey Global
Institute, CA, USA.
2. Laudon, K.C. and Laudon, J.P. (2012) Management Information
Systems: Managing the Digital Firm. 13th Edition, Pearson Education,
US.
3. House, W. (2012) Fact Sheet: Big Data across the Federal Government.
4. Mousanif, H., Sabah, H., Douiji, Y. and Sayad, Y.O. (2014) From
Big Data to Big Projects: A Step-by-Step Roadmap. International
Conference on Future Internet of Things and Cloud, 373-378
5. Oracle Enterprise Architecture White Paper (March 2016) An Enterprise
Architect’s Guide to Big Data: Reference Architecture Overview.
6. Laney, D. (2001) 3D Data Management: Controlling Data Volume,
Velocity and Variety, Gartner Report.
7. Sagiroglu, S. and Sinanc, D. (2013) Big Data: A Review. International
Conference on Collaboration Technologies and Systems (CTS), 42-47.
8. de Roos, D., Zikopoulos, P.C., Melnyk, R.B., Brown, B. and Coss, R.
(2012) Hadoop for Dummies. John Wiley & Sons, Inc., Hoboken, New
Jersey, US.
9. Grolinger, K., Hayes, M., Higashino, W.A., L’Heureux, A., Allison,
D.S. and Capretz1, M.A.M. (2014) Challenges of MapReduce in Big
Data, IEEE 10th World Congress on Services, 182-189.
10. Hurwitz, J.S., Nugent, A., Halper, F. and Kaufman, M. (2012) Big Data
for Dummies, 1st Edition, John Wiley & Sons, Inc, Hoboken, New
Jersey, US.
11. Han, J., Kamber, M. and Pei, J. (2006) Data Mining: Concepts and
Techniques. 3rd Edition, Elsevier (Singapore).
12. Data Lake. https://en.m.wikipedia.org/wiki/Data_lake
13. Hu, H., Wen, Y.G., Chua, T.-S. and Li, X.L. (2014) Toward Scalable
Systems for Big Data Analytics: A Technology Tutorial. IEEE Access,
2, 652-687. https://doi.org/10.1109/ACCESS.2014.2332453
14. Dean, J. and Ghemawat, S. (2008) MapReduce: Simplified Data
Processing on Large Clusters. Commun ACM, 107-113. https://doi.
org/10.1145/1327452.1327492
15. Storm Project. http://storm.apache.org/releases/2.0.0-SNAPSHOT/

Concepts.html
16. Neumeyer, L., Robbins, B., Nair, A. and Kesari, A. (2010) S4:
Distributed Stream Computing Platform. 2010 IEEE International
Conference on Data Mining Workshops (ICDMW). https://doi.
org/10.1109/ICDMW.2010.172
Chapter 10
Application Research of Big Data

Technology in Audit Field
Guanfang Qiao
WUYIGE Certified Public Accountants LLP, Wuhan, China
ABSTRACT
The era of big data has brought great changes to various industries, and
the innovative application effect of big data-related technologies also shows
obvious advantages. The introduction and application of big data technology
in the audit field also become the future development trend. Compared with
the traditional mode of audit work, the application of big data technology
can help to achieve more ideal results, which needs to promote the adaptive
transformation and adjustment of audit work. This paper makes a brief
analysis of the application of big data technology in audit field, which first
introduces the characteristics of big data and its technical application, and
then points out the new requirements for audit work in the era of big data,
and finally discusses how to apply the big data technology in the audit field,
hoping that it can be used for reference.
Citation: Qiao, G. (2020), “Application Research of Big Data Technology in Audit

Field”. Theoretical Economics Letters, 10, 1093-1102. doi: 10.4236/tel.2020.105064.
Keywords: Big Data, Technology, Audit, Application
INTRODUCTION
With the rapid development of information technology in today’s world, the
amount of data information is getting larger and larger, which presents the
characteristics of big data. Big data refers to a collection of data that cannot
be captured, managed and processed by conventional software tools within
a certain time. It is a massive, high-growth, diversified information asset
that requires new processing models to have greater decision-making power,
insight and process optimization capabilities. Under the background of big
data development era, all walks of life should actively adapt to it in order to
form a more positive change. With the development of new social economy,
audit work is faced with higher requirements. The traditional audit methods
and concepts have been difficult to show good adaptability, and it is very easy
to appear many problems and defects. So positive changes should be made,
and proper and scientific integration of big data technology is an effective
measurement, which deserves high attention. Of course, the application of
big data technology in the field of audit is indeed facing higher difficulties,
for example, the development of audit software and the establishment of
audit analysis model need to be effectively adjusted from multiple levels in
order to give full play to the application value of big data technology.
OVERVIEW OF BIG DATA TECHNOLOGY

Big data technology is a related technical means emerging with the
development of big data era. It mainly involves big data platform, big data
index system and other related technologies, and has been well applied
in many fields. Big data refers to massive data, and the corresponding
information data cannot be intuitively observed and used. It is faced with high
difficulties in data information acquisition, storage, analysis and application,
and inevitably shows strong application significance, and has become an
important content that attracts more attention under the development of the
current information age. From the point of view of big data itself, in addition
to the obvious characteristics of large amount, it is often characterized by
obvious diversity, rapidity, complexity and low value density. Therefore,
it is inevitable to bring great difficulty to the application of these massive
data, and it puts forward higher requirements for the application of big data
technology, which needs to be paid high attention to (Ingrams, 2019).
Application Research of Big Data Technology in Audit Field 203
Based on the development of the era of big data, the core is not to obtain
massive data information, but how to conduct professional analysis and
processing for these massive information, so as to play its due role and value.
In this way, it is necessary to strengthen the research on big data technology,
so that all fields can realize the optimization analysis and processing of
massive data information with the assistance of big data technology, and
meet the original application requirements. In terms of the development
and application of current big data technologies, data mining technology,
massively parallel processing database, distributed database, extensible
storage system and cloud computing technology are commonly used. These
big data technologies can be effectively applied to the massive information
acquisition, storage, analysis and management.
Big data has ushered in a major era of transformation, which has changed
our lives, work and even our thinking. More and more industries maintain
a very optimistic attitude towards the application of big data, and more and
more users are trying or considering how to use similar big data to solve the
problem, so as to improve their business level. With the gradual advance of
digitization, big data will become the fourth strategy that enterprises can
choose after the three traditional competitive strategies of cost leadership,
differentiation and concentration.
REQUIREMENTS ON AUDITING IN THE ERA OF

BIG DATA
Change Audit Objectives

In the era of big data development, in order to better realize the flexible
application of big data technology in the auditing field, it is necessary to pay
high attention to the characteristics of the era of big data development, and
it requires adaptive transformation so as to create good conditions for the
application of big data technology. With the development of big data era,
the audit work should first pay attention to the effective transformation of
its own audit objectives, and it is necessary to gradually broaden the tasks
of audit work in order to improve the value of audit work and give play to
the application value of big data technology. In addition to find all kinds of
abnormal clues in the target and control the illegal behaviors, audit work
also needs to take into account promoting the development of the target,
which requires the realization of the optimization assistance for relevant
operating systems, and to play an active role in risk assessment and benefit
improvement, better explore the law of development, and then give play to
the reference value in decision analysis (Gepp, 2018).
Change Audit Content

With the development of big data era, the development and transformation
of audit field also need to focus on the specific audit content. The change of
audit content is also the basic requirement of applying big data technology.
It is necessary to select appropriate big data technology centering on audit
content. Under the background of big data, audit work is often faced with
more complicated content, which involves not only the previously simple data
information such as amount and various expenses, but also more complicated
text information, audio information and video information. As the content
is more abundant, it will inevitably increase the difficulty of analysis and
processing, which puts forward higher requirements for the application of
big data technology. Of course, this also requires the collection of rich and
detailed massive data information as much as possible in the future audit
work, so as to better complete the audit task and achieve the audit objectives
mentioned above with the assistance of big data technology (Alles, 2016).
Change Audit Thinking

Under the background of development of big data era, the development of
audit field should also focus on the change of thinking, which is also the key
premise to enable audit staff to actively adapt to the new situation. Only by
ensuring that audit staff have new audit thinking, can they flexibly apply big
data technology, optimize and deal with rich practical contents in the era of
big data, and finally better enhance the audit value. Specifically, the change
in audit thinking is mainly reflected in the following aspects: First of all,
audit staff needs to change the previous sampling audit thinking and realize
a comprehensive audit, carry out a comprehensive audit analysis for all the
information to avoid any omission. Secondly, the requirements of precision
should be gradually reduced in the process of audit (Shukla, 2019). Because
with the application of big data, its information value density is relatively
low, which is likely to affect the accuracy of data, so it needs to be optimized
with the help of appropriate big data technology. In addition, the change of
audit thinking also needs to change from the original causal relationship to
the correlation relationship, which requires that emphasis should be placed
on exploring the correlation between different factors and indicators, so as
to provide reference for decision-making and other work.
Change Audit Techniques

With the development of big data era, the transformation of audit field
needs to be embodied in the technical level. Because audit content is more
complex and involves a large number of types, so it is inevitable that the
traditional audit technology cannot form a good satisfactory effect, thus
we need to focus on innovation and optimization of the technical means.
Based on the higher requirements of audit work under this new situation,
the following conditions should be met in the application of correlation
analysis technology. First of all, the application of audit technology should
be suitable for the analysis and processing of complex data, requiring it to
be able to carry out comprehensive analysis of a variety of different types of
data information, so as to avoid the improper analysis technology. Secondly,
the application of audit technology should also show the characteristics of
intuition, try to use visual analysis means, to promote the audit results can
be presented more ideal, for reference and application. In addition, in the
era of big data, the application of technologies related to audit work often
needs to pay attention to data mining, which requires mining valuable clues
from massive data information and significantly improving the speed of data
mining and analysis to meet the characteristics of big data.
APPLICATION OF BIG DATA TECHNOLOGY IN

AUDIT FIELD
Data Mining Analysis

The application of big data technology in the audit field should focus on
data mining analysis, which is obviously different from the data verification
analysis under the traditional audit mode, and can better realize the efficient
application of data information. In the previous audit work, random sampling
was usually conducted on the collected financial data and information, and
then the samples were checked and proofread one by one to verify whether
there were obvious abnormal problems, which mainly involved query analysis,
multi-dimensional analysis and other means. However, with the application
of big data technology, data warehouse, data mining, prediction and analysis
and other means can be better used to realize the comprehensive analysis
and processing of massive data information, in order to better explore the
laws of corresponding data information (Harris, 2020). The commonly used
methods include classification analysis, correlation analysis, cluster analysis
and sequence analysis. Based on the transformation of data analysis brought

by big data technology application, the value of the audit work can be further
promoted. It is no longer confined to the problem verification, but tries to
explore more relationship between data. It enables these data information to
play a stronger application value based on the found multiple correlation,
and avoid the waste of data information (Sookhak, 2017). For example, in
the financial loan audit, such data mining analysis technology can be fully
utilized to realize the classification analysis of all data information in the
loan, so as to better identify the difference between non-performing loans
and normal loans, and provide reference for the follow-up loan business.
Real-Time Risk Warning

The application of big data technology in the field of audit also has the effect
of risk prevention, which is more convenient and efficient to analyze and
identify possible risk factors, so as to give timely warning, avoid the risk
problems leading to major accidents, and realize the control of economic
losses. This is also obviously better than the traditional audit work mode.
In the past, audit work often only found problems and gave feedback to
some clues of violation, but it was difficult to realize risk warning. In the
application of big data technology, the characteristics of sustainability are
usually reflected. It can dynamically analyze and process the constantly rich
and updated data, so that it can continuously monitor and dynamically grasp
the change status of the audit target, then give timely feedback and early
warning to the abnormal problems so as to remind the relevant personnel to
take appropriate measures to prevent the problems. Therefore, the follow-up
audit in the future audit field needs to be gradually promoted, so as to better
realize the optimization of audit work with the help of big data technology. In
the application of following-up audit mode, higher requirements are usually
put forwards for the data analysis platform, which need to pay attention
to technical innovation, establishing comprehensive audit data analysis
platform and utilizing the means such as the Internet and information, to
create favorable conditions for big data technology application and avoid
the hidden problems in data collection. In carrying out the land tax audit
work, for example, the application value of the follow-up audit mode is
outstanding. The relevant staff often require full access to huge amounts
of data on the provincial land tax information, reflect the characteristic
of linkage, and update the corresponding data in real time, then they can
analyze follow-up audit and timely require some abnormal problems so as
to take timely warning to solve these problems.
Multi-Domain Data Fusion

Of course, in order to better play the application value of big data technology
in the audit field, it is necessary to pay attention to the extensive collection
and sorting of data information, which requires to make comprehensive
analysis and judgment from multiple angles as far as possible to avoid the
impact of incomplete data and information on the analysis effect. With the
application of big data technology, audit work often involves the cross-
analysis and application of multiple different database information, thus
posing higher challenges to big data technology. It is necessary to ensure that
it has cross-database analysis ability, and can use appropriate and reasonable
analysis tools to better analyze and identify possible abnormal problems.
Therefore, it is necessary to pay attention to the fusion and application of
multi-domain data, which requires the comprehensive processing of multiple
databases, so as to obtain richer and more detailed analysis results and play
a stronger role in subsequent applications. For example, in order to analyze
and clarify China’s macroeconomic and social risks, it is often necessary
to comprehensively analyze government debt audit data, macroeconomic
operation data, social security data and financial industry data, etc., so as to
obtain more accurate results and conduct early risk warning. In the economic
responsibility audit, it puts forward higher requirements for data fusion in
multiple fields, and needs to obtain corresponding data information from
finance, social security, industry and commerce, housing management,
tax, public security and education, then integrate the data information
effectively, and define the economic responsibility by means of horizontal
correlation analysis and vertical comparison analysis in order to optimize
the adjustment (Xiao, 2020).
Build a Large Audit Team

The application of big data technology in audit field is not only the
innovation transformation at the technical level, but also the transformation
from multiple perspectives such as organizational mode and personnel
structure, so as to better adapt to this new situation and avoid serious defects
in audit work. For example, in view of the obvious isolated phenomenon
in the previous audit work, although the simple audit conducted by each
department as a unit can find some problems, it is difficult to form a more
comprehensive and detailed audit effect. The audit value is obviously
limited and needs to be innovated and adjusted by big data technology.
Based on this, the construction of large audit group has become an important
application mode. The future audit work should rely on the large audit
group to divide the organizational structure from different functions such
as leadership decision-making, data analysis and problem verification, so as
to realize the orderly promotion of the follow-up audit work. For example,
the establishment of a leading group could facilitate the implementation of
the audit plan, to achieve leadership decisions for the entire audit work. For
the analysis of massive data information, the data analysis group is required
to make full analysis of the target with the help of rich and diverse big
data technologies, so as to find clues and problems and explore rules and
relationships. However, the clues and rules discovered need to be further
analyzed by the problem verification team and verified in combination with
the actual situation, so as to complete the audit task (Castka et al., 2020).
The application of this large audit team mode can give full play to the
application value of big data technology, avoid the interference brought
by organizational factors, and become an important trend of optimization
development in the audit field in the future. Of course, in order to give full
play to the application value of the large audit team, it is often necessary
to focus on the optimization of specific audit staff to ensure that all audit
staff have a higher level of competence. Audit staff not only need to master
and apply big data-related technical means, but also need to have big data
thinking, realize the transformation under this new situation, and avoid
obstacles brought by human problems. Based on this, it is of great importance
to provide necessary education and training for audit staff, which should
carry out detailed explanation around big data concept, technology and new
audit mode, so as to make them better adapt to the new situation
Data Analysis Model and Audit Software Development

In the current development of audit field, as an important development
trend, the application of big data technology does show obvious advantages
in many aspects, and also can play good functions. However, due to the
complex audit work, the application of big data technology is bound to have
the characteristics of keeping pace with the times and being targeted, so
as to better improve its service value. Based on this, the application of big
data technology in future audit work should focus on the development of
data analysis model and audit professional software, so as to create good
application conditions for the application of big data technology. First
of all, in-depth and comprehensive analysis in the audit field requires a
comprehensive grasp of all audit objectives and tasks involved in the audit
industry, so as to purposefully develop the corresponding data analysis
model and special software, and promote its application in subsequent audit
work more efficient and convenient. For example, the query analysis, mining
analysis and multi-dimensional analysis involved in the audit work need to
be matched with the corresponding data analysis model in order to better
improve the audit execution effect. In the development and application of
audit software, it is necessary to take into account the various functions. For
example, in addition to discovering and clarifying the defects existing in the
audit objectives, it is also necessary to reflect the risk warning function, so
as to better realize the audit function and highlight the application effect of
big data technology.
Cloud Audit
The application of big data technology in the auditing field is also developing
towards cloud auditing, which is one of the important manifestations of
the development of big data era. From the application of big data related
technologies, it is often closely related to cloud computing, and they are
often difficult to be separated. In order to better use big data technology, it
is necessary to rely on cloud computing mode to better realize distributed
processing, cloud storage and virtualization processing, facilitate the
efficient use of massive data information, and solve the problem of data
information. Based on this, the application of big data technology in audit
field should also pay attention to the construction of cloud audit platform
in the future, to better realize the optimization and implementation of
audit work. In the construction of cloud audit, it is necessary to make full
use of big data technology, intelligent technology, Internet technology
and information means to realize the orderly storage and analysis and
application of massive data information, and at the same time pay attention
to the orderly sharing of massive data information, so as to better enhance
its application value. For example, for the comprehensive analysis of the
above mentioned cross-database data information, cloud audit platform can
be used to optimize the processing. The overall analysis and processing
efficiency is higher, which can effectively meet the development trend of
the increasing difficulty of the current audit. Of course, the application of
cloud audit mode can also realize the remote storage and analysis of data
information, which obviously improves the convenience of audit work,
breaks the limitation of original audit work on location, and makes the data
sharing effect of relevant organizations stronger, thus solving the problem of

isolated information (Appelbaum, 2018).
RISK ANALYSIS OF BIG DATA AUDIT

Although big data audit plays an important role in improving the audit
working mode and improving the audit working efficiency, there are still
some risks in the use of data acquisition management that need to be paid
attention to:
Data Acquisition and Collation Risks

Data acquisition risks are mainly reflected in two aspects: On the one hand,
there is a lack of effective means to verify the data of the auditee, and the
integrity and authenticity of the data cannot be guaranteed, which can only
be verified through the later extended investigation. On the other hand, the
quality of collected data is not good, and a large number of invalid data
will seriously affect the quality of data analysis. In addition, data collected
outside the auditees, such as network media and social networking sites,
also have high data risks. In terms of data collation, many audit institutions
have collected data from a number of industries, but the data standards and
formats of each industry are not same. Even within the same industry, the
data formats used by organizations vary widely. In the absence of a unified
audit data standard table, the data collation is difficult, and the multi-domain
data association analysis method is still difficult in the practical application
process.
Data Analysis and Usage Risks

The risk of data analysis is mainly reflected in the analytical thinking and
methods of auditors. When auditors are not familiar with the business and
have weak data modeling ability, they are likely to make logic errors in the
actual analysis, resulting in the deviation of data analysis results. In terms of
data usage, due to the influence of factors such as data authenticity, integrity
and logical association of data tables, the data analysis results are often
greatly deviated from the actual situation. If direct use the data analysis
results, there is a greater risk, so auditors need to be cautious.
Data Management Risk

Data management risks are mainly manifested as data loss, disclosure and
destruction in the process of storage and transmission. The data collected
during auditing involves information of many industries. The loss and
disclosure of data will cause great losses to relevant units, and at the same
time, it will also have a negative impact on the authority and credibility
of audit institutions. Among them, the most important data management
risk is the management of data storage equipment, such as loss of auditors’
computers and mobile storage media, weak disaster prevention ability of
computer room equipment, insufficient data network encryption, etc., which
should be the key areas of attention to prevent data management risk.
CONCLUSION
In a word, the introduction and application of big data technology
has become an important development trend in the current innovative
development of audit field in China. With the introduction and application
of big data, audit work does show obvious advantages with more prominent
functions. Therefore, it is necessary to explore the integration of big data
technology in the audit field from multiple perspectives in the future, and
strive to innovate and optimize the audit concept, organizational structure,
auditors and specific technologies in order to create good conditions for
the application of big data technology. This paper mainly discusses the
transformation of big data technology to the traditional audit work mode and
its specific application. However, as the application of big data in the field
of audit is not long, the research is inevitably shallow. With the development
of global economic integration, multi-directional and multi-field data fusion
will make audit work more complex, so big data audit will be normalized
and provide better reference for decision-making.
REFERENCES
1. Alles, M., & Gray, G. L. (2016). Incorporating Big Data in Audits:
Identifying Inhibitors and a Research Agenda to Address Those
Inhibitors. International Journal of Accounting Information Systems,
22, 44-59. https://doi.org/10.1016/j.accinf.2016.07.004
2. Appelbaum, D. A., Kogan, A., & Vasarhelyi, M. A. (2018). Analytical
Procedures in External Auditing: A Comprehensive Literature Survey
and Framework for External Audit Analytics. Journal of Accounting
Literature, 40, 83-101. https://doi.org/10.1016/j.acclit.2018.01.001
3. Castka, P., Searcy, C., & Mohr, J. (2020). Technology-Enhanced
Auditing: Improving Veracity and Timeliness in Social and
Environmental Audits of Supply Chains. Journal of Cleaner Production,
258, Article ID: 120773. https://doi.org/10.1016/j.jclepro.2020.120773
4. Gepp, A., Linnenluecke, M. K., O’Neill, T. J., & Smith, T. (2018). Big
Data Techniques in Auditing Research and Practice: Current Trends
and Future Opportunities. Journal of Accounting Literature, 40, 102-
115. https://doi.org/10.1016/j.acclit.2017.05.003
5. Harris, M. K., & Williams, L. T. (2020). Audit Quality Indicators:
Perspectives from Non-Big Four Audit Firms and Small Company
Audit Committees. Advances in Accounting, 50, Article ID: 100485.
https://doi.org/10.1016/j.adiac.2020.100485
6. Ingrams, A. (2019). Public Values in the Age of Big Data: A Public
Information Perspective. Policy & Internet, 11, 128-148. https://doi.
org/10.1002/poi3.193
7. Shukla, M., & Mattar, L. (2019). Next Generation Smart Sustainable
Auditing Systems Using Big Data Analytics: Understanding the
Interaction of Critical Barriers. Computers & Industrial Engineering,
128, 1015-1026. https://doi.org/10.1016/j.cie.2018.04.055
8. Sookhak, M., Gani, A., Khan, M. K., & Buyya, R. (2017).
WITHDRAWN: Dynamic Remote Data Auditing for Securing Big
Data Storage in Cloud Computing. Information Sciences, 380, 101-
116. https://doi.org/10.1016/j.ins.2015.09.004
9. Xiao, T. S., Geng, C. X., & Yuan, C. (2020). How Audit Effort Affects
Audit Quality: An Audit Process and Audit Output Perspective. China
Journal of Accounting Research, 13, 109-127. https://doi.org/10.1016/j.
cjar.2020.02.002
SECTION 3: DATA MINING METHODS
Chapter 11
A Short Review of Classification

Algorithms Accuracy for Data
Prediction in Data Mining Applications
Ibrahim Ba’abbad, Thamer Althubiti, Abdulmohsen Alharbi, Khalid

Alfarsi, Saim Rasheed
Department of Information Technology, Faculty of Computing and Information Technology,
King Abdulaziz University, Jeddah, KSA.
ABSTRACT
Many business applications rely on their historical data to predict their
business future. The marketing products process is one of the core processes
for the business. Customer needs give a useful piece of information that
helps to market the appropriate products at the appropriate time. Moreover,
services are considered recently as products. The development of education
and health services is depending on historical data. For the more, reducing
online social media networks problems and crimes need a significant
source of information. Data analysts need to use an efficient classification
algorithm to predict the future of such businesses. However, dealing with a
Citation: Ba’abbad, I. , Althubiti, T. , Alharbi, A. , Alfarsi, K. and Rasheed, S. (2021),

“A Short Review of Classification Algorithms Accuracy for Data Prediction in Data
Mining Applications”. Journal of Data Analysis and Information Processing, 9, 162-
174. doi: 10.4236/jdaip.2021.93011.
creativecommons.org/licenses/by/4.0.
huge quantity of data requires great time to process. Data mining involves
many useful techniques that are used to predict statistical data in a variety
of business applications. The classification technique is one of the most
widely used with a variety of algorithms. In this paper, various classification
algorithms are revised in terms of accuracy in different areas of data mining
applications. A comprehensive analysis is made after delegated reading of
20 papers in the literature. This paper aims to help data analysts to choose
the most suitable classification algorithm for different business applications
including business in general, online social media networks, agriculture,
health, and education. Results show FFBPN is the most accurate algorithm
in the business domain. The Random Forest algorithm is the most accurate in
classifying online social networks (OSN) activities. Naïve Bayes algorithm
is the most accurate to classify agriculture datasets. OneR is the most
accurate algorithm to classify instances within the health domain. The C4.5
Decision Tree algorithm is the most accurate to classify students’ records to
predict degree completion time.
Keywords: Data Prediction Techniques, Accuracy, Classification Algo-

rithms, Data Mining Applications
INTRODUCTION
Decision-makers in the business sector are always concerning about their
business future. Since data collections form the core resource of information,
digitalizing business activities help to collect business operational data in
enormous storages named as a data warehouse. These historical data can be
used by data analysts to predict the future behavior of the business. However,
dealing with a huge quantity of data requires great time to process.
Data mining (DM) is a technique that uses information technology and
statistical methods to search for potential worthy information from a large
database that can be used to support administrative decisions. The reason
behind the importance of DM is that data can be converted into useful
information and knowledge automatically and intelligently. In addition,
enterprises use data mining to know companies that work status and analyze
potential information values. Information mined should be protected from
the disclosure of company secrets.
Different data mining concepts were described by Kaur [1]
functionalities, material, and mechanisms. Data mining involves the use of
sophisticated data analysis tools and techniques to find advanced ambiguity,
A Short Review of Classification Algorithms Accuracy for Data Prediction... 217
patterns, and relationships that are valid in large data sets. The best-known
data mining technique is Association. In association, a pattern is discovered
based on a relationship between items in the same transaction. Clustering
is a data mining technology that creates a useful group of objects that have
comparative features using the programmed strategy. Decision Tree is one
of the most common data mining techniques. One of the most difficult things
to do is when choosing to implement a data mining framework is to know
and decide which method to use and when.
However, one of the most implemented data mining techniques in a
variety of applications is the classification technique. The classification
process needs two types of data: training data and testing data. Training
data are the data used by a data mining algorithm to learn the classification
metrics to classify the other data i.e. testing data. Many business applications
rely on their historical data to predict their business future.
The literature presents various problems that were solved by predicting
through data mining techniques. In business, DM techniques are used to
predict the export abilities of companies [2] . In social media applications,
missing link problems between online social networks (OSN) nodes are a
frequent problem in which a link is supposed to be between two nodes, but
it becomes a missing link for some reasons [3] . In the agriculture sector,
analyzing soil nutrients will prove to be a large profit to the growers through
automation and data mining [4] .
Data mining technique is used to enhance the building energy
performance through determining the target multi-family housing complex
(MFHC) for green remodeling [5] . In crime, preventing offense and force
against the human female is one of the important goals. Different data
mining techniques were used to analyze the causes of offense [6] . In the
healthcare sector, various data mining tools have been applied to a range of
diseases for detecting the infection in these diseases such as breast cancer
diagnosis, skin diseases, and blood diseases [7] .
For the more, data analysts in the education field used data mining
techniques to develop learning strategies at schools and universities [8] .
Another goal is to detect several styles of learner behavior and forecast his
performance [9] . One more goal is to forecast the student’s salary after
graduation based on the student’s previous record and behavior during the
study [10] . In general, services are considered products.
In this paper, various classification algorithms are revised in terms of
accuracy in different areas of data mining applications. This paper aims
to help data analysts to choose the most suitable classification algorithm

for different business applications including business, in general, reducing
online social media networks problems, developing education, health and
agriculture sector services.
The present paper consists of the following sections: Section 2 presents
a methodology for several data mining techniques in the literature. Section
3 summarizes the results obtained from the related literature and further
discussion. Finally, section 4 presents our conclusions and recommendations
for future work.
METHODS IN LITERATURE
The classification technique is one of the most implemented data mining
techniques in a variety of applications. The classification process needs two
types of data: training data and testing data. Training data are the data used
by a data mining algorithm to learn the classification metrics to classify the
other data i.e. testing data. Two data sets of text articles are used and classified
into training data and testing data. Three traditional classification algorithms
are compared in terms of accuracy and execution time by Besimi et al. [11]
. K-nearest neighbor classifier (K-NN), Naïve Bayes classifier (NB), and
Centroid classifier are considered. K-NN classifier is the slowest classifier
since it uses the whole training data as a reference to classify testing data.
On the other hand, the Centroid classifier uses the average vector for each
class as a model to classify new data. Hence, the Centroid classifier is much
faster than the K-NN classifier. In terms of accuracy, the Centroid classifier
has the highest accuracy rate among the others.
Several data mining techniques were used to predict the export abilities
of a sample of 272 companies by Silva et al. [2] . Synthetic Minority
Oversampling Technique (SMOTE) is used to oversample unbalanced
data. The K-means method is used to group the sample into three different
clusters. The generalized Regression Neural Network (GRNN) technique
is used to minimize the error between the actual input data points in the
network and the regression predicting vector in the model. Feed Forward
Back Propagation Neural Network (FFBPN) is a technique used in machine
learning to learn the pattern of specific input/output behavior for a set of
data in a structure known as Artificial Neural Networks (ANN). Support
Vector Machine (SVM) is a classification technique used to classify a set
of data according to similarities between them. A Decision Tree (DT) is a
classification method in which classes are represented in a series of yes/

no questions in a tree view. Naive Bayes is a classification technique used
to classify one data set in several data sets according to the Bayes theorem
probability concept. As a result, after applying those techniques GRNN
and FFBPN were the most accurate techniques used to predict the export
abilities of companies.
Social media applications are developed based on Online Social
Network (OSN) concept. Missing link problems between OSN nodes are a
frequent problem in which a link is supposed to be between two nodes, but
it becomes missing link regard to some reasons. Support Vector Machine
(SVM), k-Nearest Neighbor (KNN), Decision Tree (DT), Neural Network,
Naive Bayes (NB), Logistic Regression, and Random Forest are prediction
techniques used to predict the missing link of two Facebook data sets by
Sirisup and Songmuang [3] . One dataset (DS1) with high density and the
other dataset (DS2) with low density. High density reflects that there is a
huge number of links between nodes. For high-density data set, Random
Forest gives the best performance among the others in terms of accuracy,
precision, F-measure, and area under the receiver operating characteristic
curve (AUC). On the other hand, the low-density data set can be predicted
perfectly with either Random Forest or Decision Tree. In the end, it can be
said that Random Forest is the best prediction technique used to predict data
in the OSN concept.
Analyzing soil nutrients will be evidence to be a large profit to the
growers. An agricultural survey has been capitalizing on technical advances
such as automation, and data mining. Chiranjeevi and Ranjana [4] carried
out a comparative analysis of two algorithms i.e. Naive Bayes and J48. J48 is
the improvement of the C4.5 classifier. A choice tree is a flowchart like a tree
development, where each inner hub explains a test on a characteristic. Naive
Bayes is a modest probabilistic classifier based on the Bayesian theorem
with tough naive individuality anticipation. Naive Bayes Algorithm can be
bespoke to prophesy harvest growing in a soil specimen.
A decision support model was developed for determining the target
multi-family housing complex (MFHC) for green remodeling using a data
mining technique. Jeong et al. [5] locate the goal of MFHC for green remodel
that is necessary to establish a careful and sensible evaluation method of the
building energy performance. The energy benchmark for MFHC in South
Korea, but there was a limitation that the study was conducted on the MFHC
used district heating system. To locate the green remodel goal of the MFHC,
it is necessary to regard different heating systems that are used in MFHC

e.g. individual heating systems, district heating systems, and central heating
systems. However, there were two issues regarding this study. First, the
operational rating and energy benchmark system were proposed regarding
the different variables of the heating system. Second, the model to locate
the goal of MFHC for green remodel was developed regarding the different
characteristics. The developed decision support model can serve as a sensible
standard to locate the goal of MFHC for green remodeling.
Preventing offense and force against the human female is one of the
important goals for police. Different data mining techniques were used
to analyze the causes of offense and the relationships between multiple
offenses. These techniques play important roles in offense analysis and
forecasting. Kaur et al. [6] reviews the data mining techniques used in
offense forecasting and analysis. It was concluded from this discussion that
most researchers used classification and clustering techniques for offense
manner and disclosure. In the classification, the following techniques were
used: Naïve Bayes, decision tree Bayesnet, J48, JRip, and OneR.
For the more, Kumar et al. [12] proposed a data mining technique for
cyber-attack issues. Many applications are included in the cybersecurity
concept. However, these applications need to be analyzed by data mining
techniques to audit as a computer application. Deception of secret
information can happen through security crack access by an unauthorized
user. Malicious software and viruses such as a trojan horse that is the reason
for the infringement insecurity that leads to antisocial activities in the world
of cyber-crime. Data mining techniques that can be restricting either secret
information or data to legitimate users and unauthorized access could be
blocked.
However, Thongsatapornwatana [13] provides a survey of techniques
used to analyze crime modalities in previous research. The survey focuses on
various types of crimes e.g. violent crime, drugs, border control, and cyber
criminality. Survey results show that most of the techniques used contain
research gaps. These techniques failed to accurately detect crime prediction,
which increases the challenges of overcoming this failure. Hence, these
techniques need crime models, analysis, and prepare data to find appropriate
algorithms.
Data mining in the healthcare sector is just as important as exploring
various areas. The mission of understanding removal in health care records
is an exacting task and complex. Mia et al. [7] review the different academic
literature based on health care data to find the existing data mining methods
and techniques described. Many data mining tools have been applied to a
range of diseases for detecting the infection in these diseases such as breast
cancer diagnosis, skin diseases, and blood diseases. Data mining execution
has high effectiveness in this domain due to express amplification in the size
of remedial data.
Moreover, Kaur and Bawa [14] present to the medical healthcare sector
a detailed view of popular data mining techniques to the researchers so that
they can work more exploratory. Knowledge discovery in databases (KDD)
analyzes large volumes of data and turns it into meaningful information.
There is a boon to data mining techniques because it helps in the early
diagnosis of medical diseases with high accuracy in which saves more time
and money in any effort related to computers, robots, and parallel processing.
Among all the medical diseases, cardiovascular is the most critical disease.
Data mining is proved efficacious as accuracy is a major concern. Data
mining techniques are proved to be successfully used in the treatment of
various other serious diseases which have a threat to lives.
As another attempt, a comparative analysis is conducted by Parsania et
al. [15] to find the best data mining classification techniques based on
healthcare data in terms of accuracy, sensitivity, precision, false-positive
rate, and f-measure. Naïve Bayes, Bayesian Network, J RIPPER (JRip),
OneRule (OneR), and PART techniques are selected to be applied over a
dataset from a health database. Results show that the PART technique is
the best in terms of precision, false-positive rate, and f-measure metrics. In
terms of accuracy, the OneR technique is the best while Bayesian Network
is the best technique in terms of sensitivity.
Data mining techniques are used widely in several fields. Data analysts
in the education field used data mining techniques to develop learning
strategies at schools and universities since it serves a big chunk of society.
A corporative learning model to group learners into active learning groups
via the web was introduced by Amornsinlaphachai [8] . Artificial Neural
Network (ANN), K-Nearest Neighbor (KNN), Naive Bayes (NB), Bayesian
Belief Network (BN), RIPPER (called JRIP), ID3, and C4.5 (called J48)
classification data mining algorithms are used to predict the performance
of 474 students who study computer programming subject at Nakhon
Ratchasima Rajabhat University in Thailand. A comparison between those
algorithms is made to select the most efficient algorithm among them.
As a result, C4.5 was the most efficient algorithm in predicting students’

academic performance levels in terms of different measures such as
correctness of the predicated data, data precision, recall, f-measure, mean
absolute error, and processing time. Although C4.5 does not have the lowest
processing time, it gets the highest percentage of correctness i.e. 74.89
percent since it is a simple and reliable algorithm. ID3 algorithm gets the
lowest percentage of correctness since its irrationality. Selecting learners
to form active learning groups by the introduced model using the C4.5
algorithm shows a better learning level against traditional selecting by
instructors.
To obtain a successful decision that improves learner rendering and
helps him to proceed in education. Jalota and Agrawal [16] used five
classification techniques on the education dataset collected through the
Learning Management System (LMS). Techniques that have been used
are the J48 algorithm, Support Vector Machine algorithm, Naïve Bayes
algorithm, Random Forest algorithm, and Multilayer Perceptron algorithm.
All these technologies are beneath the Waikato Environment for Knowledge
Analysis (WEKA). After comparisons, the results showed that Multilayer
Perceptron outperformed other techniques since it got the highest results in
performance accuracy and performance metrics.
Roy and Garg [9] present a literature survey of data mining techniques
used in Educational Data Mining (EDM). Data mining techniques are used
in the EDM domain to detect several styles of learner behavior and forecast
his performance. It was concluded that most of the previous research
collected data on predicting student performance by a set of questionnaires.
The Cross-Industry Standard Process for Data Mining (CRISP-DM) model
was used. WEKA and (R tool) are data mining tools based on open-source
language applied for statistical and data analysis.
As an application of data mining techniques in the education field,
Khongchai and Songmuang [10] created an incentive for students by
predicting the learner’s future salary. Learners are often bored with academic
studies. This can cause making their grades poor or even they leave college.
It is due to the loss of motivation that encourages them to continue their
studies. To provide a good incentive for learners to make sure to continue
their studies and develop their academic level. This can be achieved by
suggesting a model that forecasts the student’s salary after graduation based
on the student’s previous record and behavior during the study.
In the meantime, the data mining techniques used in this model are
K-Nearest Neighbors (K-NN), Naive Bayes (NB), Decision trees J48,
Multilayer Perceptron (MLP), and Support Vector Machines (SVM). To
determine the preferable technique for predicting future salary, a test was
conducted by entering data of students graduating from the same university
during the years 2006 to 2015. A WEKA (Waikato Environment for
Knowledge Analysis) tool was used to compare the outputs of data mining
techniques. The results showed that after comparisons work outperformed
(KNN) technique in predicting 84.69 percent for Recall, Precision, and
F-measure. The other techniques were as follows: (J48) get a percentage of
73.96 percent, (SVM) (43.71 percent), Naive Bayes (NB) (43.63 percent),
and Multilayer perceptron (MLP) (38.8 percent). A questionnaire was then
distributed to 50 current students at the university to see if the model works
to achieve its objectives. The results of the questionnaire indicate that the
proposed model increased the motivation of the students and helped them to
focus on continuing the study.
Sulieman and Jayakumari [17] proposed the importance of using
technology data mining 11th grade in Oman, which contains a lot of units that
provide the school in Oman administration inclusive student data. The goal
is to decrease the dropout rate of students and improve school performance.
Using data mining techniques helps students to choose the appropriate
mathematics for 11th grade in Oman. It is an opportunity to develop and give
appropriate analysis through such a method that extracts student information
from the end-of-term grades to improve student performance. Knowledge
derives from data mining helps decision-makers in the field of education
make the perfect decision that will help in the development of educational
processing. The math subject uses data mining techniques. The results of
the various algorithms acquired from the various data using in a study that
confirm the fact the prediction of student choice and performance can be
obtained using data mining techniques.
Academic databases used to be analyzed through a data mining
approach to earn new helpful knowledge. Wati et al. [18] prophesy the
degree-accomplishment time of bachelor’s degree students by using data
mining algorithms such as C4.5, and naive Bayes classifier algorithms.
They concentrate on the achievement of ranking data mining algorithms
especially the C4.5 algorithm with its decision tree-based and naive Bayes
classifier algorithm based on a gain ratio to find the nodes. it shows in the
result of the foresee degree accomplishment time of bachelor’s degree the
C4.5 algorithm is preferable in rendering gauge with (78 percent) precision,

(85 percent) measured mean class precision, and (65 percent) measured
mean class recall.
Anoop Kumar and Rahman [19] used data mining techniques in
inculcating a setting is called educational data mining (EDM). The
possibilities for data mining in education and the data to be reaped are
illimitable. Erudition discovering by data mining techniques can be used
not only to utility the teachers to manage their classes and understand
their students learning processes. As a result, all of these helps ensure the
advancement of students in their academics and enforce few treatments if
the progress is infeasible to the programming and institutional anticipation.
The basic advantage is that kind of analysis avails to establish a solution for
slow learners. Useful for achieving educational data mining methods which
are using presently to improvements in teaching and predict the performance
of students to predict academic performance in the learning process.
To conclude, techniques are used in data mining to modify raw data to
helpful reference in the education environment. Data mining in educational
environments has widespread implementation. Educational environments
result in a large amount of student data, that is can be used for different
purposes like predicting the needs of students. Rambola et al. [20] compare
the techniques and algorithms for data mining that are used in a different
implementation, thus assessing their efficiency. Categorized objectives of
educational data mining can be achieved in three types: prediction, clustering,
and relationship mining. Some of the most common connotations, which
are considerably used in educational data mining, are mentioned such as
association rule mining, classification, clustering, and outlier detection rule.
Association rule mining is applied for unsuccessful type extraction and to
recommend the best course for the student.
RESULTS AND DISCUSSION

In this section, we summarize the comparison results that were obtained
from the literature in different business applications. Table 1 shows the
comparison of classification algorithms that are used to predict data in
business, online social media networks, agriculture, health, and education
applications domains.
As mentioned in [11] , k-nearest neighbors (k-NN) classifier, Naïve
Bayes (NB) classifier, and Centroid classifier as classification algorithms
are compared.
Table 1: Comparison of classification algorithms in multiple applications
Politics, technology, and sports news articles are used with a total of
237 news articles. Experiments show that the Centroid classifier is the most
accurate algorithm in classifying text documents since it classifies 226
news articles correctly. Centroid classifier calculates the average vector for
each class and uses them as a reference to classify each new test instance.
However, k-NN needs to compare the test instance distance with all training
instances distances for each time.
In [2] , 272 companies are taken as a study sample to be classified. Five
classification algorithms are used to classify companies into three classes:
Generalized Regression Neural Network (GRNN), Feed Forward Back
Propagation Neural Network (FFBPN), Support Vector Machine (SVM),
Decision Tree (DT), and Naïve Bayes (NB). Results show that FFBPN is
the most accurate algorithm to classify instances in the business domain
with an accuracy of 85.2 percent.
Two Online Social Networks (OSN) datasets are used to compare the
performance of seven classification algorithms. The first dataset (DS1) with
High density (0.05) and the other dataset (DS2) with low-density (0.03).
The two datasets were obtained using the Facebook API tool. Each dataset
contains public information about the users such as interests, friends, and
demographics data. Classification algorithms include; Support Vector
Machine (SVM), k-Nearest Neighbors (k-NN), Decision Tree (DT), Neural
Networks, Naïve Bayes (NB), Logistic Regression, and Random Forest. As
results show in [3] , the Random Forest algorithm is the most accurate in
classifying OSN activities even with a high-density OSN dataset.
A dataset of 1676 soil samples has 12 attributes that need to be classified.
J48 Decision Tree (J48 DT) and Naïve Bayes (NB) classification algorithms
are used. Results in [4] tells that the NB algorithm is more accurate than J48
DT to classify agriculture datasets since it classifies 98 percent of instances
correctly.
An experiment is conducted in the health domain to classify 3163
patients’ data as mentioned in [15] . Naïve Bayes (NB), Bayesian Network
(BayesNet), J Ripper (JRip), One Rule (OneR), and PART classification
algorithms are used. Results show that OneR is the most accurate algorithm
to classify instances in the health domain with an accuracy of 99.2 percent.
Random Forest, Naïve Bayes (NB), Multilayer Perceptron (MLP),
Support Vector Machine (SVM), and J48 Decision Tree (J48 DT)
classification algorithms are used. 163 instances are used as an experimental
dataset of students’ performance. Results in [16] tell that the MLP algorithm
is the most accurate algorithm to classify students’ performance datasets
since it classifies 76.1 percent of instances correctly. 13,541 students’
profiles are used as a dataset to examine five classification algorithms.
k-Nearest Neighbors (k-NN), Naïve Bayes (NB), J48 Decision Tree (J48
DT), Multilayer Perceptron (MLP), and Support Vector Machine (SVM)
were compared in terms of accuracy. As results show in [10] , the k-NN
algorithm is the most accurate algorithm with an 84.7 percent accuracy level.
297 students’ records were used as a dataset in [18] . Two classification
algorithms are applied: C4.5 Decision Tree (C4.5 DT), and Naïve Bayes
(NB). Results tell that the C4.5 DT algorithm is more accurate than NB to
classify Students’ records since it classifies 78 percent of instances correctly.
CONCLUSIONS AND FUTURE WORK

Data mining involves many useful techniques that are used to predict
statistical data in a variety of business applications. The classification
technique is one of the most widely used with a variety of algorithms.
In this paper, various classification algorithms were revised in terms of
accuracy in different areas of data mining applications including business
in general, online social media networks, agriculture, health, and education
to help data analysts to choose the most suitable classification algorithm
for each business application. Experiments in the reviewed literature show
that the Centroid classifier is the most accurate algorithm in classifying text
documents. FFBPN is the most accurate algorithm to classify instances in
the business domain. The Random Forest algorithm is the most accurate in
classifying OSN activities. Naïve Bayes algorithm is more accurate than
J48 DT to classify agriculture datasets. OneR is the most accurate algorithm
to classify instances in the health domain. Multilayer Perceptron algorithm
is the most accurate algorithm to classify students’ performance datasets.
K-Nearest Neighbors algorithm is the most accurate algorithm in classifying
students’ profiles to increase their motivation. C4.5 Decision Tree algorithm
is more accurate than Naïve Bayes to classify students’ records.
As future work, consideration to review more related papers in mentioned
domains as well as discover new domains will significantly add to the work.
Hence, the paper will be used as a reference by business data analysts.
REFERENCES
1. Harkiran, K. (2017) A Study On Data Mining Techniques And
Their Areas Of Application. International Journal of Recent Trends
in Engineering and Research, 3, 93-95. https://doi.org/10.23883/
IJRTER.2017.3393.EO7O3
2. Silva, J., Borré, J.R., Castillo, A.P.P., Castro, L. and Varela, N. (2019)
Integration of Data Mining Classification Techniques and Ensemble
Learning for Predicting the Export Potential of a Company. Procedia
Computer Science, 151, 1194-1200. https://doi.org/10.1016/j.
procs.2019.04.171
3. Sirisup, C. and Songmuang, P. (2018) Exploring Efficiency of Data
Mining Techniques for Missing Link in Online Social Network. 2018
International Joint Symposium on Artificial Intelligence and Natural
Language Processing (iSAI-NLP), Pattaya, 15-17 November 2018.
https://doi.org/10.1109/iSAI-NLP.2018.8692951
4. Chiranjeevi, M.N. and Nadagoudar, R.B. (2018) Analysis of Soil
Nutrients Using Data Mining Techniques. International Journal of
Recent Trends in Engineering and Research, 4, 103-107. https://doi.
org/10.23883/IJRTER.2018.4363.PDT1C
5. Jeong, K., Hong, T., Chae, M. and Kim, J. (2019) Development of
a Decision Support Model for Determining the Target Multi-Family
Housing Complex for Green Remodeling Using Data Mining
Techniques. Energy and Buildings, 202, Article ID: 109401. https://
doi.org/10.1016/j.enbuild.2019.109401
6. Kaur, B., Ahuja, L. and Kumar, V. (2019) Crime against Women:
Analysis and Prediction Using Data Mining Techniques. International
Conference on Machine Learning, Big Data, Cloud and Parallel
Computing (COMITCon), 14-16 February 2019, Faridabad. https://
doi.org/10.1109/COMITCon.2019.8862195
7. Mia, M.R., Hossain, S.A., Chhoton, A.C. and Chakraborty, N.R. (2018)
A Comprehensive Study of Data Mining Techniques in Health-Care,
Medical, and Bioinformatics. International Conference on Computer,
Communication, Chemical, Material and Electronic Engineering
(IC4ME2), Rajshahi, 8-9 February 2018. https://doi.org/10.1109/
IC4ME2.2018.8465626
8. Amornsinlaphachai, P. (2016) Efficiency of Data Mining Models to
Predict Academic Performance and a Cooperative Learning Model.
8th International Conference on Knowledge and Smart Technology

(KST), Chiang Mai, 3-6 February 2016. https://doi.org/10.1109/
KST.2016.7440483
9. Roy, S. and Garg, A. (2017) Analyzing Performance of Students by
Using Data Mining Techniques: A Literature Survey. 4th IEEE Uttar
Pradesh Section International Conference on Electrical, Computer
and Electronics (UPCON), Mathura, 26-28 October 2017. https://doi.
org/10.1109/UPCON.2017.8251035
10. Khongchai, P. and Songmuang, P. (2017) Implement of Salary
Prediction System to Improve Student Motivation Using Data Mining
Technique. 11th International Conference on Knowledge, Information
and Creativity Support Systems (KICSS), Yogyakarta, 10-12 November
2016. https://doi.org/10.1109/KICSS.2016.7951419
11. Besimi, N., Cico, B. and Besimi, A. (2017) Overview of Data
Mining Classification Techniques: Traditional vs. Parallel/Distributed
Programming Models. Proceedings of the 6th Mediterranean
Conference on Embedded Computing, Bar, 11-15 June 2017, 1-4.
https://doi.org/10.1109/MECO.2017.7977126
12. Kumar, S.R., Jassi, J.S., Yadav, S.A. and Sharma, R. (2016) Data-
Mining a Mechanism against Cyber Threats: A Review. International
Conference on Innovation and Challenges in Cyber Security (ICICCS-
INBUSH), Greater Noida, 3-5 February 2016. https://doi.org/10.1109/
ICICCS.2016.7542343
13. Thongsatapornwatana, U. (2016) A Survey of Data Mining Techniques
for Analyzing Crime Patterns. Second Asian Conference on Defence
Technology (ACDT), Chiang Mai, 21-23 January 2016. https://doi.
org/10.1109/ACDT.2016.7437655
14. Kaur, S. and Bawa, R.K. (2017) Data Mining for diagnosis in
Healthcare Sector-a review, International Journal of Advances in
Scientific Research and Engineering.
15. Vaishali, S., Parsania, N., Jani, N. and Bhalodiya, N.H. (2014)
Applying Naïve Bayes, BayesNet, PART, JRip and OneR Algorithms
on Hypothyroid Database for Comparative Analysis. International
Journal of Darshan Institute on Engineering Research & Emerging
Technologies, 3, 60-64.
16. Jalota, C. and Agrawal, R. (2019) Analysis of Educational Data Mining
using Classification. International Conference on Machine Learning,
Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, 14-

16 February 2019. https://doi.org/10.1109/COMITCon.2019.8862214
17. Al-Nadabi, S.S. and Jayakumari, C. (2019) Predict the Selection of
Mathematics Subject for 11th Grade Students Using Data Mining
Technique. 4th MEC International Conference on Big Data and Smart
City (ICBDSC), Muscat, 15-16 January 2019. https://doi.org/10.1109/
ICBDSC.2019.8645594
18. Wati, M., Haeruddin and Indrawan, W. (2017) Predicting Degree-
Completion Time with Data Mining. 3rd International Conference
on Science in Information Technology (ICSITech), Bandung, 25-26
October 2017. https://doi.org/10.1109/ICSITech.2017.8257209
19. Anoopkumar, M. andZubair Rahman, A.M.J.Md. (2016) A Review on
Data Mining Techniques and Factors Used in Educational Data Mining
to Predict Student Amelioration, International Conference on Data
Mining and Advanced Computing (SAPIENCE), Ernakulam, 16-18
March 2016.
20. Rambola, R.K., Inamke, M. and Harne, S. (2018) Literature Review:
Techniques and Algorithms Used for Various Applications of
Educational Data Mining (EDM). 4th International Conference on
Computing Communication and Automation (ICCCA), Greater Noida,
14-15 December 2018. https://doi.org/10.1109/CCAA.2018.8777556
Chapter 12
Different Data Mining Approaches

Based Medical Text Data
Wenke Xiao1, Lijia Jing2, Yaxin Xu1, Shichao Zheng1, Yanxiong Gan1, and
Chuanbiao Wen1
1
School of Medical Information Engineering, Chengdu University of Traditional Chi-
nese Medicine, Chengdu 611137, China
2
School of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu
611137, China
ABSTRACT
The amount of medical text data is increasing dramatically. Medical text
data record the progress of medicine and imply a large amount of medical
knowledge. As a natural language, they are characterized by semistructured,
high-dimensional, high data volume semantics and cannot participate
in arithmetic operations. Therefore, how to extract useful knowledge or
information from the total available data is very important task. Using various
techniques of data mining can extract valuable knowledge or information
from data. In the current study, we reviewed different approaches to apply
Citation: Wenke Xiao, Lijia Jing, Yaxin Xu, Shichao Zheng, Yanxiong Gan, Chuan-
biao Wen, “Different Data Mining Approaches Based Medical Text Data”, Journal of
Healthcare Engineering, vol. 2021, Article ID 1285167, 11 pages, 2021. https://doi.
org/10.1155/2021/1285167.
Copyright: © 2021 by Authors. This is an open access article distributed under the Cre-
ative Commons Attribution License, which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
for medical text data mining. The advantages and shortcomings for each
technique compared to different processes of medical text data were
analyzed. We also explored the applications of algorithms for providing
insights to the users and enabling them to use the resources for the specific
challenges in medical text data. Further, the main challenges in medical text
data mining were discussed. Findings of this paper are benefit for helping
the researchers to choose the reasonable techniques for mining medical text
data and presenting the main challenges to them in medical text data mining.
INTRODUCTION
The era of big data is coming with the mass of data growing at an incredible
rate. The concept of big data for the first time was put forward in the 11th EMC
World conference in 2011, which refers to large-scale datasets that cannot be
captured, managed, or processed by common software tools. With the arrival
of big data age, the amount of medical text data is increasing dramatically.
Analyzing this immense amount of medical text data to extract the valuable
knowledge or information is useful for decision support, prevention,
diagnosis, and treatment in medical world [1]. However, analyzing the
huge amount of multidimensional or raw data is very complicated and time-
consuming task. Data mining has capabilities for this matter.
Data mining is a methodology for discovering the novel, valuable, and
useful information, knowledge, or hidden pattern from enormous datasets by
using various statistical approaches. Data mining is with many advantages
in contrast to the traditional model for transforming data to knowledge
with some manual analysis and interpretation. Data mining approaches are
quicker, favorable, time-saving, and objective. Summarizing various data
mining approaches in medical text data for clinical applications is essential
for health management and medical research.
This paper is organized in four sections. Section 2 presents the concepts
of medical text data. Section 3 includes data mining approaches and its
applications in medical text data analysis. Section 4 concludes this paper
and presents the future works.
MEDICAL TEXT DATA

The diversity of big data is inseparable from the abundance of data sources.
Medical big data including experimental data, clinical data, and medical
imaging data are increasing with the rapid development of medicine.
Different Data Mining Approaches Based Medical Text Data 233
Medical big data are the application of big data in the medical field after
the data related to human health and medicine have been stored, searched,
shared, analyzed, and presented in innovative ways [2]. Medical text data
are an important part of medical big data which are described in natural
language, cannot participate in an arithmetic operation, and are characterized
by semistructured, high-dimensional, high data volume semantics [3]. They
cannot be well applied in research owing to no fixed writing format and
being highly professional [4]. Medical text data contain clinical data, medical
record data, medical literature data, etc., and this type of data records the
progress of medicine and implies a large amount of medical knowledge.
However, utilizing human power to extract the facts of relationships between
entities from a vast amount of medical text requires time-consuming efforts.
With the development of data mining technology, data mining technology
used for medical text to discover the relationships in medical text becomes
the hot topic. Medical text data mining is able to assist the discovery of
medical information. In the COVID-19 research field, medical text mining
can help decision-makers to control the crown outbreak by gathering and
collating scientific basic data and scientific research literature related to
the new crown virus, predicting the susceptible population to new crown
pneumonia, virus variability, and potential therapeutic drugs [5–8].
MEDICAL TEXT DATA MINING

Data mining was defined in the “First section of the 1995 International
Conference on Knowledge Discovery and Data Mining,” which has been
widely used in disease auxiliary diagnosis, drug development, hospital
information system, and genetic medicine to facilitate the medical knowledge
discovery [9–12]. Data mining used to process medical text data can be
divided into four steps: data collection, data processing, data analysis, and
data evaluation and interpretation. This study summarized the algorithms
and tools for medical text data based on the four steps of data mining.
Data Preparation
Medical text data include electronic medical records, medical images,
medical record parameters, laboratory results, and pharmaceutical
antiquities according to the different data sources. The different data were
selected based on the data mining task and stored in the database for further
processing.
Data Processing
The quality of data will affect the efficiency and accuracy of data mining
and the effectiveness of the final pattern. The raw medical text data contain a
large amount of fuzzy, incomplete, noisy, and redundant information. Taking
medical records as an example, the traditional paper-based medical records
have many shortcomings, such as nonstandard terms, difficult to form clinical
decision-making support, scattered information distribution, and so on. After
the emergence of electronic medical records, the medical records data are
gradually standardized [13]. However, the electronic medical records still
as natural language are difficult for data mining. Therefore, it is necessary
to clean up and filter the data to ensure data consistency and certainty by
removing missing, incorrect, noisy, and inconsistent or no quality data.
Missing values in medical text data are usually handled by deletion
and interpolation. Deletion is the easiest method to handle, but some
useful information is lost. Interpolation is a method that assigns reasonable
substitution values to missing values through a specific algorithm. At
present, many algorithms have emerged in the process of data processing.
Multiple imputation, regression algorithm, and K-nearest neighbors are
often used to supplement missing values in medical text data. The detail
algorithm information is shown in Table 1. In order to further understand
the semantic relationships of medical texts, researchers have used natural
language processing (NLP) techniques to perform entity naming, relationship
extraction, and text classification operations on medical text data with good
results [19].
Table 1: The detailed algorithm information for missing values in medical text
data
Algorithm Principle Purpose
Multiple imputation Estimate the value to be interpolated, and Repeat the simula-
[14, 15] add different noises to form multiple groups tion to supplement the
of optional interpolation values; select the missing value
most appropriate interpolation value accord-
ing to a certain selection basis.
Expectation maxi- Compute maximum likelihood estimates or Supplement missing
mization [16] posterior distributions with incomplete data. values
K-nearest neighbors Select its K closest neighbors according to Estimate missing val-
[17, 18] a distance metric and estimate missing data ues with samples
with the corresponding mode or mean.
Natural Language Processing

Natural Language Processing (NLP) as a subfield of artificial intelligence
is mainly used for Chinese word segmentation, part-of-speech tagging,
parsing, natural language generation, text categorization, information
retrieval, information extraction, text-proofing, question answering,
machine translation, automatic summarization, and textual entailment
with the advantage of the fast process and lasting effect. It affirms positive
motivation without negative influence, which can effectively stimulate
potential, keep learning, keep growing, and keep developing [20].
In medical text processing, NLP is often used for information extraction
and entity naming including word segmentation, sentence segmentation,
syntactic analysis, grammatical analysis, and pragmatic analysis. The
schematic of natural language processing is shown in Figure 1. Kou et al.
[21] used NLP tools to extract important disease-related concepts from
clinical notes, form a multichannel processing method, and improve data
extraction ability. Jonnagaddala et al. [22] proposed a hybrid NLP model
to identify Framingham heart failure signs and symptoms from clinical
notes and electronic health record (EHR). Trivedi et al. [23] designed
an interactive NLP tool to extract information from clinical texts, which
can serve clinicians well after evaluation. Datta et al. [24] evaluated the
NLP technology to extract cancer information from EHR, summarized the
implementation functions of each framework, and found many repetitive parts
in different NLP frameworks resulting in a certain waste of resources. The
possibility of diversified medical text data will also bring the transformation
of medical data analysis mode and decision support mode. Roberts and
Demner-Fushman [25] manually annotated tags on 468 electronic medical
records to generate a corpus, which provided corpus support for medical data
mining. The development of NLP technology greatly reduces the difficulty
of manual data processing in data mining. Shikhar Vashishth et al. [26] used
semantic type filtering to improve the performance connectivity of medical
entities across all toolkits and datasets, which provided a new semantic type
prediction module for the biomedical NLP pipeline. Topaz et al. [27] used an
NLP-based classification system, support vector machine (SVM), recurrent
neural network (RNN), and other machine learning methods to identify
diabetic patients from clinical records and reduce the manual workload in
medical text data mining.
Natural Language Processing
Medical text data

Medical claims data Prescription data Image data Electronic medical record
input input
word noun
Lexical analysis word Thesaurus
string
word verb
Participle Part of speech tagging
Structured processing
SBV
I am
Local syndactyly
Syntactic parsing S V
dependency
s-v relation grammar
Phrase-structure syntactic parsing Dependency syntactic parsing
Figure 1: Schematic of natural language processing flow.
Figure 1: Schematic of natural language processing flow.
Data Analysis
Data analysis is applying data mining methods for extracting interesting
patterns. The model establishment is essential for knowledge discovery in
data analysis. According to the characteristics of the data, modeling and
analysis are performed. After the initial test, the model is parametrically
adjusted. The advantages and disadvantages of different models are analyzed
to choose the final optimal model. Data analysis methods for medical text
data include clustering, classification, association rules, and regression on
the goal. The detail information of methods is shown in Table 2.
Table 2: The information of analysis methods for medical text data
Methods Purpose Algorithms Advantages Shortcomings
Clustering Classify similar sub- K-means 1.Simple and fast 1. Large amount of
jects in medical texts [28, 29] 2. Scalability and data and time-con-
efficiency suming
2. More restrictions
on use
Classification Read medical text ANN 1. Solve complex 1. Slow training

data for intention [30, 31] mechanisms in 2. Many parameters
recognition text data and difficulty in ad-
2. High degree of justing parameters
self-learning
3. Strong fault
tolerance
Decision 1. Handle continu- 1. Overfitting

tree [32, 33] ous variables and 2. The result is un-
missing values stable
2. Judge the
importance of
features
Naive bayes 1. The learning Higher requirements
[34] process is easy for data independence
2. Good classifica-
tion performance
Association Mine frequent items Apriori Simple and easy Low efficiency and
rules and corresponding [35, 36] to implement time-consuming
association rules from
massive medical text FP-tree [37] 1. Reduce the High memory over-
datasets number of data- head
base scans
2. Reduce the
amount of
memory space
FP-growth 1. Improve data Harder to achieve
[38] density structure
2. Avoid repeated
scanning
Logistic Analyze how vari- Logistic 1.Visual under- 1.Easy underfitting
Regression ables affect results regression standing and 2. Cannot handle
[39] interpretation a large number of
2. Very sensitive multiclass features or
to outliers variables
Artificial Neural Network

-Artificial Neural Network (ANN) is a nonlinear prediction model that is
learned by training, which has the advantages of accurate classification,
self-learning, associative memory, and high speed searching for the optimal
solution and good stability in data mining. ANN mainly consists of three
parts: input layer, hidden layer, and output layer [40]. The input layer is
responsible for receiving external information and data. The hidden layer
is responsible for processing information and constantly adjusting the
connection properties between neurons, such as weights and feedback,
while the output layer is responsible for outputting the calculated results.
ANN is different from traditional artificial intelligence and information
processing technology, which overcomes the drawbacks of traditional
artificial intelligence based on logical symbols in processing intuitive and
unstructured information, and has the characteristics of self-adaption, self-
organizing, and real-time learning. It can complete data classification,
feature mining, and other mining tasks. Medical text data contain massive
amounts of patient health records, vital signs, and other data. ANN can
analyze the conditions of patients’ rehabilitation, find the law of patient
data, predict the patient’s condition or rehabilitation, and help to discover
medical knowledge [41].
There are several ANN mining techniques that are used for medical text
data, such as backpropagation and factorization machine-supported neural
network (FNN). The information on ANN mining techniques is shown in
Table 3.
Table 3: The information of ANN mining techniques
ANN mining techniques Advantages Shortcomings
Backpropagation [42] 1. Strong nonlinear mapping 1. Local minimization

capability 2. Slow convergence
2. Strong generalization ability 3. Different structure
3. Strong fault tolerance choices
Radial basis function [43] 1. Fast learning speed Complex structure
2. Easy to solve text data classifi-
cation problems
FNN [44] 1.Reduce feature engineering Limited modeling capa-
2. Improve FM learning ability bility
(1) ANN Core Algorithm: BP Algorithm. Backpropagation (BP)

algorithm, as the classical algorithm of the ANN, widely used
for medical text data. BP algorithm is developed on the basis of
single-layer neural network. It uses reverse propagation to adjust
the weights and construct multilayer network, so that the system
can continue to learn. BP is a multilayered feed-forward network
and its propagation is forward. Compared with recurrent neural
network algorithms, error spreads reversely makes it faster and

more powerful for high-throughput microarray or sequencing
data modeling [45].
BP algorithm training data is mainly divided into the following two
stages:(1)Forward propagation process: the actual output values of each
computer unit are implicitly processed layer by layer from the input layer(2)
Backpropagation process: when the output value does not reach the expected
value, the difference between the actual output and the expected output is
calculated recursively, and the weight is adjusted according to the difference.
The total error is defined as
(1)
m is the total number of samples. K is the sample data order. T is the unit
serial number. is the desired output. is the actual output.
In clinics, the judgment of disease is often determined by the integration
of multidimensional data. In the establishment of disease prediction models,
BP algorithms can not only effectively classify complex data but also have
good multifunctional mapping. The relationship between data and disease
can be found in the process of repeated iteration [46].
(2) Application Examples. Adaptive learning based on ANN can find
the law of medical development from the massive medical text data
and assist the discovery of medical knowledge. Heckerling et al.
[47] combined a neural network and genetic algorithm to predict
the prognosis of patients with urinary tract infections (as shown
in Figure 2). In this study, nine indexes (eg, frequent micturition,
dysuria, etc.) from 212 women with urinary tract infections were
used as predictor variables for training. The relationship between
symptoms and urinalysis input data and urine culture output data
was determined using ANN. The predicted results were accurate.
6 Journal of Healthcare Engineering
Data analysis
Input layer Hide layer Output layer

Reasult
Data collection Output range
Age 1 1 [0-1]
collection Values closer to 1:
The duration of symptoms
… … prediction of urinary
Symptom:dysuria, frequency, tract infection
Clinical data warehouse urgency, hematuria, Etc. Values closer to 0:
30 10 prediction of no infecti
Expected Modification right

input
Train Calculation
Error Output
Figure 2: ANN algorithm analysis process.

Data collection Data mining
Data analysis Reasult
1. t least one visit to Naive Bayes classification algorithm
Output range
Miotto et al. [48] derived a general-purpose patient representation from

CPCSSN collection primary care clinic in the 1. pecification documents Determination of the Building joint
[0-1]
past year algorithmic formula probability models
2. ata structure
2. anadian patients with Values closer to 1:
Canadian primary care 3. ata sequencing and
aggregated EHRs based on ANN that facilitates clinical predictive modeling

one or more chronic Classification of new prediction of
sentinel surveillance cleaning
conditions case x
associated diseases
network 4. ulation
given the patient status. Armstrong et al.

Figure 3: NB algorithm analysis[49]
process. used ANN to analyze 240
microcalcifications in 220 cases of mammography. Data mining results

can accurately predict whether the microcalcification in the early stage of
suspected breast cancer is benign or malignant.
Naive Bayes
Naive Bayes (NB) is a classification counting method based on the Bayes
theory [50]. The conditional independence hypothesis of the NB classification
algorithm assumes that the attribute values are independent of each other
and the positions are independent of each other [51]. Attribute values are
independent of each other, which means there is no dependence between
terms. The position independence hypothesis means that the position of the
term in the document has no effect on the calculation of probability. However,
conditional dependence exists among terms in medical texts, and the
location of terms in documents contributes differently to classification [52].
But medical text existence conditions depend on the relationship between a
middle term and the term in the document; the location of the contribution
to the classification is different. These two independent assumptions lead
to the poor effect of NB estimation. However, NB has been widely used in
medical texts because it plays an effective role in classification decision-
making.
(1) Core Algorithm: NBC4D. Naive Bayes classifier for continuous
variables using a novel method (NBC4D) is a new algorithm
based on NB. It classifies continuous variables into Naive
Bayes classes, replaces traditional distribution techniques with
alternative distribution techniques, and improves classification
accuracy by selecting appropriate distribution techniques [53].
The implementation of the NBC4D algorithm is mainly divided

into five steps:
(1) Gaussian Distribution:
(2) Exponential Distribution:
(3) Kernel Density Estimation:
(4) Rayleigh Distribution:
(5) NBC4D Method: find the product of the probability (possibility)
of each attribute of a given specific class and the probability of a
specific class to improve the accuracy
x is the input value, μ is the mean value, σ2 is the variance, α is the parameter
that represents the average value (μ), θ represents the standard deviation
(σ), K is the kernel function of Gaussian function, and h is the smoothing
parameter.
(2) Application Examples. Behrouz Ehsani Moghaddam et al. [54]
6 adopted electronic medical records (EMRs) extracted
Journal from the
of Healthcare Engineering
Canadian primary care sentinel surveillance network, used the
Naive Bayes algorithm to classify disease features, and found that Data analysis
Input layer Hide layer Output layer
Naive Bayes classifier was an effective algorithm to help physicians

Data collection
Reasult
Output range
diagnose Hunter syndrome and optimize patient management (as

Age 1 1 [0-1]
collection Values closer to 1:
The duration of symptoms
prediction of urinary
shown in Figure 3). In order to predict angiographic outcomes,

… …
Symptom:dysuria, frequency, tract infection
Clinical data warehouse urgency, hematuria, Etc. Values closer to 0:
Golpour et al. [55] used the NB algorithm to process the hospital 30 10 prediction of no infecti
medical records and assessment scale and found that the NB Expected Modification right
model with three variables had the best performance and could
input
Train Calculation
Error Output
well support physician decision-making.

Data collection Data mining

Data analysis Reasult
1. t least one visit to Naive Bayes classification algorithm
Output range
CPCSSN collection primary care clinic in the 1. pecification documents Determination of the Building joint
[0-1]
past year algorithmic formula probability models
2. ata structure
2. anadian patients with Values closer to 1:
Canadian primary care 3. ata sequencing and
one or more chronic Classification of new prediction of
sentinel surveillance cleaning
conditions case x
associated diseases
network 4. ulation
Figure 3: NB algorithm analysis process.
Figure 3: NB algorithm analysis process.
Decision Tree
The decision tree is a tree structure, in which each nonleaf node represents
a test on a feature attribute, each branch represents the output of the feature
attribute on a certain value domain, and each leaf node stores a category
[56]. The process of using a decision tree to make a decision is to start from
the root node, then test the corresponding characteristic attributes of the
items to be classified, select the output branch according to its value until it
reaches the leaf node, and finally take the category stored in the leaf node as
the decision result [57]. The advantages of decision tree learning algorithms
include good interpretability induction, various types of data processing
(categorical and numerical data), white-box modeling, sound robust
performance for noise, and large dataset processing. Medical text data is
complex [58]. For instance, electronic medical record data include not only
disease characteristics but also patient age, gender, and other characteristic
data. Since the construction of decision tree starts from a single node, the
training data set is divided into several subsets according to the attributes
of the decision node, so the decision tree algorithm can deal with the data
types and general attributes at the same time, which has certain advantages
for the complexity of medical text data processing [59]. The construction
of a decision tree is mainly divided into two steps: classification attribute
selection and number pruning. The common algorithm is C4.5 [60].
(1) Core algorithm: C4.5. Several decision tree algorithms are
proposed such as ID3 and C4.5. The famous ID3 algorithm
proposed by Quinlan in 1986 has the advantages of clear theory,
simple method, and strong learning ability. The disadvantage is
that it is only effective for small datasets and sensitive to noise.
When the training data set increases, the decision tree may
change accordingly. When selecting test attributes, the decision
tree tends to select attributes with more values. In 1993, Quinlan
proposed the C4.5 algorithm based on the ID3 algorithm [61].
Compared with ID3, C4.5 overcomes the shortages of selecting
more attributes in information attribute selection, prunes the tree
construction process, and processes incomplete data. And it uses
the gain ratio as the selection standard of each node attribute in
the decision tree [62]. In particular, its extension which is called
S-C4.5-SMOTE and can not only overcome the problem of
data distortion but also improve overall system performance. Its
mechanism aims to effectively reduce the amount of data without
distortion by maintaining the balance of datasets and technical
smoothness.
The processing formula is as follows:
(2)
n is the classification number. p(xi) represents the proportion of sample xi.
A is used as the feature of dividing data set S. is the proportion of
the number of samples in the total number of samples.
(2) Application Examples. The decision tree algorithms can construct
specific decision trees for multiattribute datasets and get feasible
results in relative time. It can be used as a good method for data
classification in medical text data mining.
Byeon [63] used the C4.5 algorithm to develop a depression prediction
model for Korean dementia caregivers based on a secondary analysis of the
2015 Korean Community Health Survey (KCHS) survey results. And the
effective prediction rate was 70%. The overall research idea is shown in
Figure 4.
Data analysis
C4.5:Processing continuous data and incomplete data
Depression
POOR GOOD
Data collection
Subjective
Subjective Reasult
Gender stress
stress
extract The risk index of the cross classification model:

Smoke
NO YSE NO YSE 0.304
Marriage The misclassification rate:30%
KCHS survey results in 2015
Etc. Disease for The frequency Disease for
the recent 2 of meeting with relatives the recent 2 weeks
weeks
NO YSE NO YSE
Figure 4: C4.5 algorithm application flow.
Figure 4: C4.5 algorithm application flow. Feature Database

Data Processing
TID ITEMS
Wei et al. [64] Data
selected
Feature
the reports from the Chinese ITEMES
spontaneous
SUP
Algorithm iterations 1 eature data1
report database from

denoising 2010 to 2011 and used a decision tree to{F1,F2,F3}
extraction calculate
50%
the 2
…
eature data2
classification of adverse drug reactions (ADR) signals. Tao Zheng et al. [65]
ECG
adopted a decision tree algorithm to construct a basic data framework. 300

data were randomly selected from the EHR of 23281 diabetic patients to
classify the type of diabetes. The performance of the framework was good
and the classification accuracy was as high as 98%.
However, decision tree algorithms are difficult to deal with missing

values in data. And there are many missing values in medical text data, due
to the high complexity of data. Therefore, when various types of data are
inconsistent, the decision tree algorithms will produce information deviation,
and the correct results cannot be obtained.
Association Rules
Association rules are often sought for very large datasets, whose efficient
algorithms are highly valued. They are used to discover the correlations
from large amounts of data and reflect the dependent or related knowledge
between events and other events [66]. Medical text data contains a large
number of association data, such as the association between symptoms and
diseases and the relationship between drugs and diseases. Mining medical
text data using an association rule algorithm is conducive to discovering
the potential links in medical text data and promoting the development of
medicine. Association rules are expressions like X ≥ Y. There are two key
expressions in the transaction database:(1)Support{X≥Y}. The ratio of the
number of transactions with X and Y to all transactions(2)Confidence{X≥Y}.
The ratio of the number of transactions with X and Y to the number of
transactions with X
Given a transaction data set, mining association rules is to generate
association rules whose support and trust are greater than the minimum
support and minimum confidence given by users, respectively.
(1) Core Algorithm: Apriori. The apriori algorithm is the earliest and
the most classic algorithm. The iterative search method is used
to find the relationship between items in the database layer by
layer. The process consists of connection (class matrix operation)
and pruning (removing unnecessary intermediate results). In
this algorithm, the concept of item set is the set of items. A set
containing K items is a set of K items. Item set frequency is the
number of transactions that contain an item set. If an item set
satisfies the minimum support, it is called a frequent item set.
Apriori algorithm is divided into two steps to find the largest item set:(1)
Count the occurrence frequency of an element item set, and find out the data
set which is not less than the minimum support to form a one-dimensional
maximum item set(2)Loop until no maximum item set is generated
(2) Application Examples. Association rules are usually a data
mining approach used to explore and interpret large transactional
datasets to identify unique patterns and rules. They are often

used to predict the correlation between index data and diseases. Data analysis
C4.5:Processing continuous data and incomplete data
Exarchos et al. [67] proposed an automation method based on

Data collection
POOR
Depression
GOOD
Reasult
association rules, used an association rule algorithm to classify

Subjective
Subjective
Gender stress
stress
extract The risk index of the cross classification model:

Smoke
NO YSE NO YSE 0.304
and model electrocardiographic (ECG) data, and monitored

Marriage The misclassification rate:30%
KCHS survey results in 2015
Etc. Disease for The frequency Disease for
the recent 2 of meeting with relatives the recent 2 weeks
weeks
ischemic beats in ECG for a long time. In this study, the specific
NO YSE NO YSE
application process
Figure of association
4: C4.5 rules
algorithm application flow. is shown in Figure 5.
Feature Database
Data Processing
TID ITEMS
Algorithm iterations 1 eature data1 ITEMES SUP
Data Feature
2 eature data2 {F1,F2,F3} 50%
denoising extraction
…
ECG
Figure 5: Application process of association rules.

Hrovat et al. [68] combined association rule mining, which was designed
for mining large transaction datasets, with model-based recursive partitioning
to predict temporal trends (e.g., behavioral patterns) for subgroups of patients
based on discharge summaries. In the correlation analysis between adverse
drug reaction events and drug treatment, Chen et al. [69] used the apriori
algorithm to explore the relationship between adverse events and drug
treatment in patients with non-small-cell lung cancer, showing a promising
method to reveal the risk factors of adverse events in the process of cancer
treatment. In the association between drugs and diseases, Lu et al. [70] used
the apriori algorithm to find herbal combinations for the treatment of uremic
pruritus from Chinese herb bath therapy and explore the core drugs.
Model Evaluation
Classifications generated by data mining models through test sets are not
necessarily optimal, which can lead to the error of test set classification.
In order to get a perfect data model, it is very important to evaluate the
model. Receiver operating characteristic (ROC) curve and area under the
curve (AUC) are common evaluation methods in medical text data mining.
The ROC curve has a y-axis of TPR (sensitivity, also called recall rate)
and an x-axis of FPR (1-specificity). The higher the TPR, the smaller the
FPR, and the higher the efficiency of the model. AUC is defined as the area
under the ROC curve, that is, AUC is the integral of ROC, and the value of
the area is less than 1. We randomly select a positive sample and a negative
sample. The probability that the classifier determines that the positive sample
value is higher than the negative sample is the AUC value. Pourhoseing Holi
et al. [71] used the AUC method to evaluate the prognosis model of rectal
cancer patients and found that the prediction accuracy of random forest (RF)
and BN models was high.
DISCUSSION
Data mining is useful for medical text data to extract novel and usable
information or knowledge. This paper reviewed several research works
which are done for mining medical text data based on four steps. It is
beneficial for helping the researchers to choose reasonable approaches for
mining medical text data. However, some difficulties in medical text data
mining are also considered.
First, the lack of a publicly available annotation database affects the
development of data mining to a certain extent, due to differences in medical
information records and descriptions among countries. Its information
components are highly heterogeneous and the data quality is not uniform.
Ultimately, it brings about a key obstacle that makes annotation bottleneck
existing in medical text data [72]. At present, the international standards
include ICD (International Classification of Diseases), SNOMED CT (The
Systematized Nomenclature of Human and Veterinary Medicine Clinical
Terms), CPT (Current Procedural Terminology), DRG (Diagnosis-Related
Groups), LOINC (Logical Observation Identifiers Names and Codes),
Mesh (Medical Subject Headings), MDDB (Main Drug Database), and
UMLS (Unified Medical Language System). There are few corpora in the
field of medical text. In recent 10 years, natural language has undergone
a truly revolutionary paradigm shift. More new technologies have been
applied to the extraction of natural language information. Many scholars
have established a corpus for a certain disease. However, there is a close
relationship between medical entities. A single corpus cannot cut the data
accurately, and it is easy to omit keyword information.
Second, text records of different countries have different opinions.
For example, Ayurvedic medicine, traditional Arab Islamic medicine, and
traditional Malay medicine from India, the Middle East, and Malaysia have
problems such as inconsistent treatment description, complex treatment
methods, and difficulty in statistical analysis, leading to great difficulty in
medical data mining [73]. At the same time, the information construction
of traditional medicine is insufficient. For example, the traditional North
American indigenous medical literature mainly involves clinical efficacy
evaluation and disease application, which is complicated in recording
methods, leading to difficulty of data mining [74]. Chinese medical texts

have the particularity of language. Unlike English expressions, Chinese
words are not separated from each other, which increases the difficulty of
data analysis. In terms of semantics, Chinese medical texts have problems
such as existential polysemy, synonym, the ambiguity of expression,
complex relationship, and lack of clear correlation. Building a standard
database based on these data is very difficult, which requires very advanced
and complex algorithms.
In addition, the electronic medical record contains personal privacy
information. Sometimes, the clinical electronic medical record data will
inevitably be used in medical text data mining. Therefore, the protection of
patient privacy data is also an issue that needs to be paid attention to in data
mining.
In future work, we will attempt to establish and popularize medical text
data standards with the help of intelligent agents and construct publicly
available annotation databases for the mining of medical text data.
ACKNOWLEDGMENTS
This work was supported by the National Natural Science Foundation
of China (81703825), the Sichuan Science and Technology Program
(2021YJ0254), and the Natural Science Foundation Project of the Education
Department of Sichuan Province (18ZB01869).
REFERENCES
1. R. J. Oskouei, N. M. Kor, and S. A. Maleki, “Data mining and medical
world: breast cancers’ diagnosis, treatment, prognosis and challenges
[J],” American journal of cancer research, vol. 7, no. 3, pp. 610–627,
2017.
2. Y. Zhang, S.-L. Guo, L.-N. Han, and T.-L. Li, “Application and
exploration of big data mining in clinical medicine,” Chinese Medical
Journal, vol. 129, no. 6, pp. 731–738, 2016.
3. B. Polnaszek, A. Gilmore-Bykovskyi, M. Hovanes et al., “Overcoming
the challenges of unstructured data in multisite, electronic medical
record-based abstraction,” Medical Care, vol. 54, no. 10, pp. e65–e72,
2016.
4. E. Ford, M. Oswald, L. Hassan, K. Bozentko, G. Nenadic, and J.
Cassell, “Should free-text data in electronic medical records be shared
for research? A citizens’ jury study in the UK,” Journal of Medical
Ethics, vol. 46, no. 6, pp. 367–377, 2020.
5. S. M. Ayyoubzadeh, S. M. Ayyoubzadeh, H. Zahedi, M. Ahmadi,
and S. R Niakan Kalhori, “Predicting COVID-19 incidence through
analysis of google trends data in Iran: data mining and deep learning
pilot study,” JMIR public health and surveillance, vol. 6, no. 2, Article
ID e18828, 2020.
6. X. Ren, X. X. Shao, X. X. Li et al., “Identifying potential treatments
of COVID-19 from Traditional Chinese Medicine (TCM) by using a
data-driven approach,” Journal of Ethnopharmacology, vol. 258, no.
1, Article ID 12932, 2020.
7. E. Massaad and P. Cherfan, “Social media data analytics on telehealth
during the COVID-19 pandemic,” Cureus, vol. 12, no. 4, Article ID
e7838, 2020.
8. J. Dong, H. Wu, D. Zhou et al., “Application of big data and artificial
intelligence in COVID-19 prevention, diagnosis, treatment and
management decisions in China,” Journal of Medical Systems, vol. 45,
no. 9, p. 84, 2021.
9. L. B. Moreira and A. A. Namen, “A hybrid data mining model for
diagnosis of patients with clinical suspicion of dementia [J],” Computer
Methods and Programs in Biomedicine, vol. 165, no. 1, pp. 39–49,
2018.
10. S. Vilar, C. Friedman, and G. Hripcsak, “Detection of drug-drug

interactions through data mining studies using clinical sources,
scientific literature and social media,” Briefings in Bioinformatics, vol.
19, no. 5, pp. 863–877, 2018.
11. H. S. Cha, T. S. Yoon, K. C. Ryu et al., “Implementation of
hospital examination reservation system using data mining
technique,” Healthcare informatics research, vol. 21, no. 2, pp. 95–
101, 2015.
12. B. L. Gudenas, J. Wang, S.-z. Kuang, A.-q. Wei, S. B. Cogill, and L.-j.
Wang, “Genomic data mining for functional annotation of human long
noncoding RNAs,” Journal of Zhejiang University - Science B, vol. 20,
no. 6, pp. 476–487, 2019.
13. R. S. Evans, “Electronic health records: then, now, and in the
future,” Yearbook of medical informatics, vol. Suppl 1, no. Suppl 1,
pp. S48–S61, 2016.
14. P. C. Austin, I. R. White, D. S. Lee, and S. van Buuren, “Missing data in
clinical research: a tutorial on multiple imputation,” Canadian Journal
of Cardiology, vol. 37, no. 9, pp. 1322–1331, 2021.
15. L. Yu, L. Liu, and K. E. Peace, “Regression multiple imputation for
missing data analysis,” Statistical Methods in Medical Research, vol.
29, no. 9, pp. 2647–2664, 2020.
16. P. C. Chang, C. L. Wang, F. C. Hsiao et al., “Sacubitril/valsartan vs.
angiotensin receptor inhibition in heart failure: a real‐world study in
Taiwan,” ESC heart failure, vol. 7, no. 5, pp. 3003–3012, 2020.
17. E. Tavazzi, S. Daberdaku, R. Vasta, C. Andrea, C. Adriano, and D.
C. Barbara, “Exploiting mutual information for the imputation of
static and dynamic mixed-type clinical data with an adaptive k-nearest
neighbours approach,” BMC Medical Informatics and Decision
Making, vol. 20, no. Suppl 5, p. 174, 2020.
18. A. Idri, I. Kadi, I. Abnane, and J. L. Fernandez-Aleman, “Missing
data techniques in classification for cardiovascular dysautonomias
diagnosis,” Medical, & Biological Engineering & Computing, vol. 58,
no. 11, pp. 2863–2878, 2020.
19. C. Wang, C. Yao, P. Chen, S. Jiamin, G. Zhe, and Z. Zheying,
“Artificial intelligence algorithm with ICD coding technology guided
by the embedded electronic medical record system in medical record
information management,” Journal of healthcare engineering, vol.

2021, Article ID 3293457, 9 pages, 2021.
20. K. Kreimeyer, M. Foster, A. Pandey et al., “Natural language processing
systems for capturing and standardizing unstructured clinical
information: a systematic review,” Journal of Biomedical Informatics,
vol. 73, pp. 14–29, 2017.
21. T. T. Kuo, P. Rao, C. Maehara et al., “Ensembles of NLP tools for
data element extraction from clinical notes,” AMIA Annual Symposium
proceedings AMIA Symposium, vol. 2016, pp. 1880–1889, 2017.
22. J. Jonnagaddala, S.-T. Liaw, P. Ray, M. Kumar, N.-W. Chang, and H.-
J. Dai, “Coronary artery disease risk assessment from unstructured
electronic health records using text mining,” Journal of Biomedical
Informatics, vol. 58, no. Suppl, pp. S203–S210, 2015.
23. G. Trivedi, E. R. Dadashzadeh, R. M. Handzel, W. C. Wendy, V. Shyam,
and H. Harry, “Interactive NLP in clinical care: identifying incidental
findings in radiology reports,” Applied Clinical Informatics, vol. 10,
no. 4, pp. 655–669, 2019.
24. S. Datta, E. V. Bernstam, and K. Roberts, “A frame semantic overview
of NLP-based information extraction for cancer-related EHR notes
[J],” Journal of Biomedical Informatics, vol. 100, no. 1, pp. 03–301,
2019.
25. K. Roberts and D. Demner-Fushman, “Annotating logical forms for
EHR questions [J]. LREC,” International Conference on Language
Resources & Evaluation: [proceedings] International Conference on
Language Resources and Evaluation, vol. 2016, no. 3, pp. 772–778,
2016.
26. S. Vashishth, D. Newman-Griffis, R. Joshi, D. Ritam, and P. R. Carolyn,
“Improving broad-coverage medical entity linking with semantic type
prediction and large-scale datasets,” Journal of Biomedical Informatics,
vol. 121, no. 10, pp. 38–80, 2021.
27. M. Topaz, L. Murga, O. Bar-Bachar, M. McDonald, and K. Bowles,
“NimbleMiner,” CIN: Computers, Informatics, Nursing, vol. 37, no.
11, pp. 583–590, 2019.
28. D. M. Maslove, T. Podchiyska, and H. J. Lowe, “Discretization of
continuous features in clinical datasets,” Journal of the American
Medical Informatics Association, vol. 20, no. 3, pp. 544–553, 2013.
29. P. Yildirim, L. Majnarić, O. Ekmekci, and H. Andreas, “Knowledge

discovery of drug data on the example of adverse reaction
prediction,” BMC Bioinformatics, vol. 15, no. Suppl 6, p. S7, 2014.
30. H. Ayatollahi, L. Gholamhosseini, and M. Salehi, “Predicting
coronary artery disease: a comparison between two data mining
algorithms,” BMC Public Health, vol. 19, no. 1, p. 448, 2019.
31. M. Reiser, B. Wiebner, B. Wiebner, and J. Hirsch, “Neural-network
analysis of socio-medical data to identify predictors of undiagnosed
hepatitis C virus infections in Germany (DETECT),” Journal of
Translational Medicine, vol. 17, no. 1, p. 94, 2019.
32. M. A. Rahman, B. Honan, T. Glanville, P. Hough, and K. Walker,
“Using data mining to predict emergency department length of stay
greater than 4 hours: d,” Emergency Medicine Australasia, vol. 32, no.
3, pp. 416–421, 2020.
33. J.-A. Lee, K.-H. Kim, D.-S. Kong, S. Lee, S.-K. Park, and K. Park,
“Algorithm to predict the outcome of mdh spasm: a data-mining
analysis using a decision tree,” World neurosurgery, vol. 125, no. 5,
pp. e797–e806, 2019.
34. A. Awaysheh, J. Wilcke, F. Elvinger, L. Rees, W. Fan, and K. L.
Zimmerman, “Review of medical decision support and machine-
learning methods,” Veterinary pathology, vol. 56, no. 4, pp. 512–525,
2019.
35. X. You, Y. Xu, J. Huang et al., “A data mining-based analysis of
medication rules in treating bone marrow suppression by kidney-
tonifying method [J]. Evidence-based complementary and alternative
medicine,” eCAM, no. 1, p. 907848, 2019.
36. A. Atashi, F. Tohidinezhad, S. Dorri et al., “Discovery of hidden
patterns in breast cancer patients, using data mining on a real data
set,” Studies in Health Technology and Informatics, vol. 262, no. 1, pp.
42–45, 2019.
37. Z. Luo, G. Q. Zhang, and R. Xu, “Mining patterns of adverse events
using aggregated clinical trial results [J],” AMIA Joint Summits
on Translational Science proceedings AMIA Joint Summits on
Translational Science, vol. 2013, no. 1, pp. 12–16, 2013.
38. X. Li, G. Liu, W. Chen, Z. Bi, and H. Liang, “Network analysis of
autistic disease comorbidities in Chinese children based on ICD-10
codes,” BMC Medical Informatics and Decision Making, vol. 20, no.
1, p. 268, 2020.
39. M. M. Liu, L. Wen, Y. J. Liu, C. Qiao, T. L. Li, and M. C. Yong,

“Application of data mining methods to improve screening for the
risk of early gastric cancer,” BMC Medical Informatics and Decision
Making, vol. 18, no. Suppl 5, p. 121, 2018.
40. Y.-c. Wu and J.-w. Feng, “Development and application of artificial
neural network,” Wireless Personal Communications, vol. 102, no. 2,
pp. 1645–1656, 2018.
41. A. Ramesh, C. Kambhampati, J. Monson, and P. Drew, “Artificial
intelligence in medicine,” Annals of the Royal College of Surgeons of
England, vol. 86, no. 5, pp. 334–338, 2004.
42. Y. Liang, Q. Li, P. Chen, L. Xu, and J. Li, “Comparative study of back
propagation artificial neural networks and logistic regression model in
predicting poor prognosis after acute ischemic stroke,” Open Medicine,
vol. 14, no. 1, pp. 324–330, 2019.
43. S. Y. Park and S. M. Kim, “Acute appendicitis diagnosis using artificial
neural networks [J]. Technology and health care,” Official Journal of
the European Society for Engineering and Medicine, vol. 23, no. Suppl
2, pp. S559–S565, 2015.
44. R.-J. Kuo, M.-H. Huang, W.-C. Cheng, C.-C. Lin, and Y.-H. Wu,
“Application of a two-stage fuzzy neural network to a prostate cancer
prognosis system,” Artificial Intelligence in Medicine, vol. 63, no. 2,
pp. 119–133, 2015.
45. L. Liu, T. Zhao, M. Ma, and Y. Wang, “A new gene regulatory network
model based on BP algorithm for interrogating differentially expressed
genes of Sea Urchin,” SpringerPlus, vol. 5, no. 1, p. 1911, 2016.
46. T. J. Cleophas and T. F. Cleophas, “Artificial intelligence for diagnostic
purposes: principles, procedures and limitations [J],” Clinical Chemistry
and Laboratory Medicine, vol. 48, no. 2, pp. 159–165, 2010.
47. P. Heckerling, G. Canaris, S. Flach, T. Tape, R. Wigton, and B. Gerber,
“Predictors of urinary tract infection based on artificial neural networks
and genetic algorithms,” International Journal of Medical Informatics,
vol. 76, no. 4, pp. 289–296, 2007.
48. R. Miotto, L. Li, and B. A. Kidd, “Deep patient: an unsupervised
representation to predict the future of patients from the electronic
health,” Records [J]. Scientific reports, vol. 6, no. 2, p. 6094, 2016.
49. A. J. Armstrong, M. S. Marengo, S. Oltean et al., “Circulating t cells
from patients with advanced prostate and breast cancer display both
epithelial and mm,” Molecular Cancer Research, vol. 9, no. 8, pp.

997–1007, 2011.
50. D. V. Lindley, “Fiducial distributions and bayes’ theorem,” Journal of
the Royal Statistical Society: Series B, vol. 20, no. 1, pp. 102–107,
1958.
51. S. Uddin, A. Khan, M. E. Hossain, and M. A. Moni, “Comparing different
supervised machine learning algorithms for disease prediction,” BMC
Medical Informatics and Decision Making, vol. 19, no. 1, p. 281, 2019.
52. H. H. Rashidi, N. K. Tran, E. V. Betts, P. H. Lydia, and G. Ralph,
“Artificial intelligence and machine learning in pathology: the present
landscape of supervised methods,” Academic pathology, vol. 6, 2019.
53. P. Yildirim and D. Birant, “Naive Bayes classifier for continuous variables
using novel method (NBC4D) and distributions,” in Proceedings of
the 2014 IEEE International Symposium on Innovations in Intelligent
Systems and Applications (INISTA), pp. 110–115, IEEE, Alberobello,
Italy, June 2014.
54. B. Ehsani-Moghaddam, J. A. Queenan, J. Mackenzie, and R. V.
Birtwhistle, “Mucopolysaccharidosis type II detection by Naïve
Bayes Classifier: an example of patient classification for a rare disease
using electronic medical records from the Canadian Primary Care
Sentinel Surveillance Network,” PLoS One, vol. 13, no. 12, Article ID
e0209018, 2018.
55. P. Golpour, M. Ghayour-Mobarhan, A. Saki et al., “Comparison
of support vector machine, naïve bayes and logistic regression for
assessing the necessity for coronary angiography,” International
Journal of Environmental Research and Public Health, vol. 17, no.
18, 2020.
56. D. Che, Q. Liu, K. Rasheed, X. Tao, and T. Xiuping, “Decision
tree and ensemble learning algorithms with their applications in
bioinformatics,” Advances in Experimental Medicine & Biology, pp.
191–199, 2011.
57. L. O. Moraes, C. E. Pedreira, S. Barrena, A. Lopez, and A. Orfao,
“A decision-tree approach for the differential diagnosis of chronic
lymphoid leukemias and peripheral B-cell lymphomas,” Computer
Methods and Programs in Biomedicine, vol. 178, pp. 85–90, 2019.
58. W. Oh, M. S. Steinbach, M. R. Castro et al., “Evaluating the impact of
data representation on EHR-based analytic tasks,” Studies in Health
Technology and Informatics, vol. 264, no. 2, pp. 88–92, 2019.
59. C. T. Nakas, N. Schütz, M. Werners, and A. B. Leichtle, “Accuracy

and calibration of computational approaches for inpatient mortality
predictive modeling,” PLoS One, vol. 11, no. 7, Article ID e0159046,
2016.
60. J. R. Quliilan, C4.5: Programs for Machine Learning, Mor-gan
Kaufmann Publisher, San Mateo, CA, vol. 993.
61. Q. JR, C4.5:Programs for Machine Learning, Morgan Kaufmann
Publishers, San Mateo, 1993.
62. G. Franzese and M. Visintin, “Probabilistic ensemble of deep
information networks,” Entropy, vol. 22, no. 1, p. 100, 2020.
63. H. Byeon, “Development of depression prediction models for
caregivers of patients with dementia using decision tree learning
algorithm,” International Journal of Gerontology, vol. 13, no. 4, pp.
314–319, 2019.
64. J.-X. Wei, J. Wang, Y.-X. Zhu, J. Sun, H.-M. Xu, and M. Li, “Traditional
Chinese medicine pharmacovigilance in signal detection: decision tree-
based data classification,” BMC Medical Informatics and Decision
Making, vol. 18, no. 1, p. 19, 2018.
65. T. Zheng, W. Xie, L. Xu et al., “A machine learning-based framework to
identify type 2 diabetes through electronic health records,” International
Journal of Medical Informatics, vol. 97, pp. 120–127, 2017.
66. R. Veroneze, T. Cruz, S. Corbi et al., “Using association rule mining
to jointly detect clinical features and differentially expressed genes
related to chronic inflammatory diseases,” PLoS One, vol. 15, no. 10,
Article ID e0240269, 2020.
67. T. P. Exarchos, C. Papaloukas, D. I. Fotiadis, and L. K. Michalis, “An
association rule mining-based methodology for automated detection of
ischemic ECG beats,” IEEE Transactions on Biomedical Engineering,
vol. 53, no. 8, pp. 1531–1540, 2006.
68. G. Hrovat, G. Stiglic, P. Kokol, and M. Ojsteršek, “Contrasting
temporal trend discovery for large healthcare databases,” Computer
Methods and Programs in Biomedicine, vol. 113, no. 1, pp. 251–257,
2014.
69. W. Chen, J. Yang, H. L. Wang, Y. F Shi, H Tang, and G. H Li,
“Discovering associations of adverse events with pharmacotherapy
in patients with non-small cell lung cancer using modified Apriori
algorithm,” BioMed Research International, vol. 2018, no. 12, Article

ID 1245616, 10 pages, 2018.
70. P. H. Lu, J. L. Keng, K. L. Kuo, F. W. Yu, C. T. Yu, and Y. K. Chan,
“An Apriori algorithm-based association rule analysis to identify herb
combinations for treating uremic pruritus using Chinese herbal bath
therapy,” Evidence-based Complementary and Alternative Medicine:
eCAM, vol. 2020, no. 8, Article ID 854772, 9 pages, 2020.
71. M. Mlakar, P. E. Puddu, M. Somrak, S. Bonfiglio, and M. Luštrek,
“Mining telemonitored physiological data and patient-reported
outcomes of congestive heart failure patients,” PLoS One, vol. 13, no.
3, Article ID e0190323, 2018.
72. I. Spasic and G. Nenadic, “Clinical text data in machine learning:
systematic review,” JMIR Medical Informatics, vol. 8, no. 3, Article
ID e17984, 2020.
73. R. R. R. Ikram, M. K. A. Ghani, and N. Abdullah, “An analysis of
application of health informatics in Traditional Medicine: a review of
four Traditional Medicine Systems,” International Journal of Medical
Informatics, vol. 84, no. 11, pp. 988–996, 2015.
74. N. Redvers and B. s. Blondin, “Traditional Indigenous medicine in
North America: a scoping review,” PLoS One, vol. 15, no. 8, Article
ID e0237531, 2020.
Chapter 13
Data Mining in Electronic Commerce:

Benefits and Challenges
Mustapha Ismail, Mohammed Mansur Ibrahim, Zayyan Mahmoud

Sanusi, Muesser Nat
Management Information Systems Department, Cyprus International University, Haspolat,
Lefkoşa via Mersin, Turkey
ABSTRACT
Huge volume of structured and unstructured data which is called big data,
nowadays, provides opportunities for companies especially those that use
electronic commerce (e-commerce). The data is collected from customer’s
internal processes, vendors, markets and business environment. This
paper presents a data mining (DM) process for e-commerce including
the three common algorithms: association, clustering and prediction. It
also highlights some of the benefits of DM to e-commerce companies in
terms of merchandise planning, sale forecasting, basket analysis, customer
relationship management and market segmentation which can be achieved
Citation: Ismail, M. , Ibrahim, M. , Sanusi, Z. and Nat, M. (2015), “Data Mining in

Electronic Commerce: Benefits and Challenges”. International Journal of Communica-
tions, Network and System Sciences, 8, 501-509. doi: 10.4236/ijcns.2015.812045.
creativecommons.org/licenses/by/4.0.
with the three data mining algorithms. The main aim of this paper is to review
the application of data mining in e-commerce by focusing on structured and
unstructured data collected thorough various resources and cloud computing
services in order to justify the importance of data mining. Moreover, this
study evaluates certain challenges of data mining like spider identification,
data transformations and making data model comprehensible to business
users. Other challenges which are supporting the slow changing dimensions
of data, making the data transformation and model building accessible to
business users are also evaluated. A clear guide to e-commerce companies
sitting on huge volume of data to easily manipulate the data for business
improvement which in return will place them highly competitive among
their competitors is also provided in this paper.
Keywords: Data Mining, Big Data, E-Commerce, Cloud Computing
INTRODUCTION
Data mining in e-commerce is all about integrating statistics, databases
and artificial intelligence together with some subjects to form a new idea
or a new integrated technology for the purpose of better decision making.
Data mining as a whole is believed to be a good promoter of e-commerce.
Presently, applying data mining to e-commerce has become a hot cake
among businesses [1] . Data mining in cloud computing is the process of
extracting structured information from unstructured or semi unstructured
web data sources. From business point of view, the core concept of cloud
computing is to render computing resources in form of service to the users
who need to buy whenever they are in demand [2] . The end product of
data mining creates an avenue for decision makers to be able to track their
customers’ purchasing patterns, demand trends and locations, making their
strategic decision more effective for the betterment of their business. This
can bring down the cost of inventory together with other expenses and
maximizing the overall profit of the company.
With the wide availability of the Internet, 21st century companies
highly utilize online tools and technologies for various reasons. Therefore,
today many companies buy and sell through e-commerce and the need for
developing e-commerce applications by an expert who takes responsibility
for running and maintaining the services is increasing. When businesses grow,
the required resources for e-commerce maintenance may increase more than
the level the enterprise can handle. Based on that regard, data mining can
Data Mining in Electronic Commerce: Benefits and Challenges 259
be used to handle e-commerce enterprise services and explore patterns for

online customers so companies can boost sales and the general productivity
of the business [3] . However, the cost of running such services is a challenge
to almost all e-commerce companies. Therefore cloud computing becomes
a game changer in the way and manner companies transact their businesses
by offering a comprehensive scalable and flexible services over the Internet.
Cloud computing provides a new breakthrough for enterprises, offering
a service model that includes network storage, new information resource
sharing, on-demand access to information and processing mechanism. It is
possible to provide data mining software via cloud computing which gives
e-commerce companies opportunity to centralize their software management
and data storage with absolute assurance of reliability, efficiency and
protected services to their users which in turn cut their cost and increase
their profit [4] .
Cloud computing is a technology that has to do with accessing products
and services in the cloud without shouldering the burden of hosting or
delivering these services. It can be also viewed as a “model that enhances
a flexible on-demand network access to a shared pool of configurable
computing resources like networks, servers, storage applications and services
that can speedily provisioned and released with minimal management effort
or service provider interaction”. In the aspect of cloud computing everything
is considered as a service. There are three service delivery models of cloud
computing namely: Infrastructure as a Service (IaaS) which is responsible
for fundamental computing resources like, storage, processing, networks
and also some standardized services over the networks. The second is the
Platform as a Service (PaaS) which gives abstractions together with the
services for developing, testing, hosting and of course maintaining the
applications in the complex and developed environment. The third one
is the Software as the Service (SaaS). The entire application or service is
delivered over the web through a browser or via application programming
interface (API). With service model the consumers only need to focus on
administering users to the system.
One of the most important applications of cloud computing is the
storage capability. Cloud storage has the capability to cluster different types
of storage equipment by employing cluster system, grid technology or
distributed system in the network to provide external data storage and access
services by the use of software application. Cloud computing in e-commerce
is the idea of paying bandwidth and storage space on the scale that depends
on the usage. It is much more on the utility on-demand basis whereby a
user pays for less with pay per use models. Most e-commerce companies
welcome the idea as it eliminates the high cost of storage for large volume
of business data by keeping it in the cloud data centers. The platform also
gives opportunity to use e-commerce business applications e.g. B2B and
B2C with smaller investment. Some other advantages of cloud computing
for e-commerce include the following: cost effective, speed of operations,
scalability and security of the entire service [3] [4] .
The association between cloud computing and data mining is that cloud
is used to store the data on the servers and data mining is use to provide
client server relationship as a service and information being collected based
on ethical issues like privacy and individuality are violated [5] .
Considering the importance of data mining for today’s companies, this
paper discusses benefits and challenges of data mining for e-commerce
companies. Furthermore, it reviews the process of data mining in e-com-
merce together with the common types of database and cloud computing in
the field of e-commerce.
DATA MINING
Data mining is the process of discovering meaningful pattern and
correlation by sifting through large amounts of data stored in repositories.
There are several tools for this data generation, which include abstractions,
aggregations, summarization and characteristics of data [6] . In the past
decade, data mining has change the e-commerce business. Data mining is
not specific to one type of data. Data mining can be germane to any type
of information source, however, algorithms and tactics may differ when
applied to different kind of data. The challenges presented by different type
of data varies. Data mining is being used in many form of databases like flat
file, data warehouses, object oriented databases and etc.
This paper concentrates on relational databases. Relational database
consists of a set of tables containing either values of entity attributes or
values of attributes from entity relationship. Tables have columns and rows,
where columns represent attributes and rows represent tuples. A tuple in
relational table corresponds to either an object or a relationship between
objects and is identified by a set of attribute values representing a unique
key [6] . The most commonly used query language for relational database is
SQL, which allows to manipulate and retrieve data stored in the tables. Data
mining algorithms using relational database can be more versatile than data
mining algorithms specifically written for flat files. Data mining can benefit
from SQL for data selection, transformation and consolidation [7] .
There are several core techniques in data mining that are used to build
data mining. Most common techniques are as follows [8] [9] :
1) Association Rules: Association rule mining is among the most
important methods of data mining. The essence of this method
is extracting interesting correlation and association among
sets of items in the transactional databases or other data pools.
Association rules are used extensively in various areas. A typical
association rule has an implication of the form A→B where A is
an item set and B is an item set that contains only a single atomic
condition [10] .
2) Clustering: This is the organisation of data in classes or it refers to
a collection of objects by grouping similar objects to form more
than one class of methods. Moreover, clustering class labels are
unidentified and it is up to the clustering algorithm to discover
acceptable classes. Clustering is sometimes called unsupervised
classification. The reason was classification is not dictated by
given class labels. Clustering is the process of grouping a set of
physical or abstract object into classes of similar object [10] .
3) Prediction: Prediction has attracted substantial attention given
the possible consequences of successful forecasting in a business
context. There are two types of predictions. The first one is
predicting unavailable data values and the second one is as soon
as classification model is form on a training set, the class label of
the object can be predicted based on the attribute values of the
object. Prediction is more often referred to the forecast of missing
numerical values [10] .
SOME COMMON DATA MINING TOOLS

1) Weka: To have accurate data mining result require the right tool
for the dataset you are mining. Weka however, gives the ability to
put into reality the learning methods algorithms. The tool has lots
of benefits as it’s include all the standard data mining procedures
like data pre-processing, clustering, association, classification,
regression and also attribute selection. It has both the Java and
non-Java version together with visualization application, and the
tool is free to users to customize it to their own specification [11]

[12] .
2) NLTK: It is mainly for language processing task with pool of
different language processing tools together with machine
learning, data mining and sentiment analysis, data scrapping and
different language processing tasks. NLTK tool require a user to
install the tool on his systems and have access to the full package.
It is built in python and a user can build application on top and
can play around with the tool to his own specification. All the
three mentioned tools above are open source [11] .
3) Spider Miner: A data mining tool that does not require a user
to write a code, written in Java programming language. Part of
SpiderMiner tool capability is that, it provides a thorough analytics
via template-based frameworks. It is very flexible tool and user
friendly offered as a service, and apart from data mining function,
the tool can visualize, predict, data pre-processing, deployment
statistical modelling and of course evaluation functions. In the
tool there learning schemes, algorithms and models from WEKA
and R script which makes the tool to be more powerful [12] . All
the three mentioned tools above are open source.
DATA MINING IN E-COMMERCE

Data mining in e-commerce is a vital way of repositioning the e-commerce
company for supporting the enterprise with the required information
concerning the business. Recently, most companies adopt e-commerce and
being in possession of big data in their data repositories. The only way to
get the most out of this data is to mine it to increase decision making or
to enable business intelligence. In e-commerce data mining there are three
important processes that data must pass before turning into knowledge or
application. Figure 1 shows the steps for data mining in e-commerce.
Selection
Data warehouse
Data Pre-prooessing
Target Database
Data Mining Pattern

Cleaned data Cleaned data Cleaned data
Pattern
Analysis
Interpretation & Validation

Pattern
Analysis
Knowledge
Figure 1: Data mining process in e-commerce [16] .

The first and easier process of data mining is data preprocessing and it
is actually a step before the data mining, whereby, the data is cleaned by
removing the unwanted data that has no relation with the required analysis.
Hence, the process will boost the performance of the entire data mining
process and the accuracy of the data will also be high and the time needed
for the actual mining will be minimise reasonably. Usually this happens
if company already have an existing target data warehouse, but if not
then the process will consume at least 80% of the selection, cleaning and
transformation of data termed as preprocessing [13] .
Mining pattern is the second step and it actually refers to techniques or
approach used to develop a recommendation rules, or developing a model
out of a large data set. It can also be referred as techniques or algorithms of
data mining. The most common patterns used in e-commerce are prediction,
clustering and association rules.
The purpose of third step which is pattern analysis is to verify and shade
more light on the discovered model in order to give a clear path for the
startup up for applying of the data mining result. The analysis lay much
emphasis on the statistics and rules of the pattern used, by observing them
after multiple users have accessed them [14] .
However all this has to do with how iterative the overall process is, and
the interpretation of visual information you get at each sub step. Therefore,
in general data mining process iterates from the following five basic steps,
which are:
• Data selection: This step is all about identifying the kind of data
to be mined, the goals for it and the necessary tool to enable
the process. At the end of it the right input attributes and output
information in order to represent the task are chosen.
• Data transformation: This step is all about organising the data
based on the requirements by removing noise, converting one
type of data to another, normalising the data if there is need to,
and also defining the strategy to handle the missing data.
• Data mining step per se: Having mined the transformed data using
any of the techniques to extract pattern of interest, the miner can
also make data mining method by performing the proceeding
steps correctly.
• Result interpretation and validation: For better understanding
of data and it synthesised knowledge together with its validity
span, the robustness is check by data mining application test. The
information retrieved can also be evaluated by comparing it with
the earlier expertise in the application domain.
• Incorporation of the discovered knowledge: This has to do with
presenting the result of discovered knowledge to decision maker
so that it is possible to compare or check/resolve for conflict with
an earlier extracted knowledge where a new discovered pattern
can be applied [15] .
BENEFITS OF DATA MINING IN E-COMMERCE

Application of data mining in e-commerce refers to possible areas in the
field of e-commerce where data mining can be utilised for the purpose of
enhancements in business. As we all know while visiting an online store
for shopping, users normally leave behind certain facts that companies can
store in their database. These facts represent unstructured or structured
data that can be mined to provide a competitive advantage to the company.

The following areas are where data mining can be applied in the field of
e-commerce for the benefits of companies:
1) Customer Profiling: This is also known as customer-oriented
strategy in e-commerce. This allows companies to use business
intelligence through the mining of customer’s data to plan
their business activities and operations as well as develop new
research on products or services for prosperous e-commerce.
Classifying the customers of great purchasing potentially from
the visiting data can help companies to lessen the sales cost [17] .
Companies can use users’ browsing data to identify whether they
purposefully shopping or just browsing or buying something they
are familiar with or something new. This helps companies to plan
and improve their infrastructure [18] .
2) Personalization of Service: Personalization is the act to provide
contents and services geared to individuals on the basis of
information of their needs and behavior. Data mining research
related to personalization has focused mostly on recommender
systems and related subjects such as collaborative filtering.
Recommender systems have been explored intensively in the
data mining community. This systems can be divided into three
groups: Content-based, social data mining and collaborative
filtering. These systems are cultured and learned from explicit
or implicit feedback of users and are usually represented as the
user profile. Social data mining, in considering the source of data
that are created by the group of individuals as part of their daily
activities, can be important source of important information for
companies. Contrarily, personalization can be achieved by the aid
of collaborative filtering, where users are matched with particular
interest and in the same vein the preferences of these users to
make recommendations [19] .
3) Basket Analysis: Every shoppers’ basket has a story to tell and
market basket analysis (MBA) is a common retail, analytic
and business intelligence tool that helps retailers to know their
customers better. There are different ways to get the best out of
market basket analysis and these include:
– Identification of product affinities; tracking not so apparent
product affinities and leveraging on them is the real
challenge in retail. Walmart customers purchasing Barbie

dolls shows an affinity towards one of three candy bars,
obscure connection such as this canbe discovered with
an advanced market basket analytics for planning more
effective marketing efforts.
– Cross-sell and up-sell campaigns; these shows the products
purchased together, so customers who purchase the printer
can be persuaded to pick up high quality paper or premium
cartridges.
– Planograms and product combos; are used for better
inventory control based on product affinities, developing
combo offers and design effective user friendly planograms
in focusing on products that sells together.
– Shoppers profile; in analyzing market basket with the aid of
data mining over time to get a glimpse of who your shoppers
really are, gaining insight to their ages, income range,
buying habits, likes and dislikes, purchase preferences,
levering this and giving the customer experience [19] .
4) Sales Forecasting: Sales forecasting involves the aspect of the
time an individual customer spend to buy an item and in this
process trying to predict if the customer will buy again. This
type of analysis can be used to determine a strategy of planned
obsolescence or figure out complimentary products to sell. In
sales forecasting, cash flow can be projected into three which
include the pessimistic, optimistic and the realistic. This helps to
have a plan on the adequate amount of capital available to endure
the worst possible scenario that is if sales do not go actually as
planned [19] .
5) Merchandise Planning: Merchandise planning is useful for both
online and offline retail companies. In the case of online business,
merchandise planning will help to determine stocking options and
the inventory warehousing, while in the case of offline companies,
business that are looking to boost by adding stores can assess the
required amount of merchandise they will be adequately needing
by having a foresight at the exact layout of the current store [20] .
Using the right approach to merchandise planning will definitely lead to

answers on what to do with:
• Pricing: the aspect of database mining will help determining the
suited best price of products or services in the processes of reveal-
ing customer sensitivity.
• Deciding on products; data mining provides e-commerce businesses
with the aspect of which products customers actually desire, which
includes the aspect of intelligence on competitor’s merchandise.
• Balancing of stocks; in mining the retail database, it helps determine
the right and specific amount of stocks needed i.e. not too much and
not too less, throughout the business year and also during the buying
seasons.
6) Market Segmentation: Customer segmentation is one of the best
uses of data mining. From the lots of data gotten, it can be broken
down into different and meaningful segments like income, age,
gender, occupation of customers, and this can be used when
either the companies are running email marketing campaigns
or SEO strategies. The aspect of market segmentation can also
help a company identify its own competitors. This provided
information alone can help the retail company identify that the
periodic respondents are usually not the only ones pointing the
same customer money as the present company is [21] .
Segmenting the database of a retail company will improve the
conversion rates as the company can focus there promotion on a close-fitted
and highly wanted market. This also helps the retail company to understand
the competitors that are involved in each and every segment in the process
permitting the customization of products that will actually satisfy the target
audience in a generic way [21] .
CHALLENGES OF DATA MINING IN E-COMMERCE

Besides the benefits data mining provides challenges for e-commerce
companies, which are as follows:
1) Spider Identification: As it is commonly known main aim of data
mining is to convert data into useful knowledge. Main source
of data for e-commerce companies is web pages. Therefore, it
is critical for e-commerce companies to understand how search
engines work to follow how quickly things happen, how they
happen and when changes will show up in the search engines.
Spiders are software programs that are sent out by the search
engine to find new information. These spiders can also be called
as bots or crawlers. It is a software program that search engine
uses to request pages and download them, it comes as a surprise
to some people, however what the search engine does is they use
a link of an existing website to find a new website and request
a copy of that page to download it to their server. This is what
the search engines use to run the ranking algorithm against and
that is what shows up in the search engine result page. Therefore,
the challenge here is that the search engines need to download
a correct copy of the website. E-commerce website needs to be
readable and seeable and the algorithm is applied to the search
engines database. Tools are needed to have the mechanisms to
enable them automatically remove unwanted data that will be
transformed to information in order for data mining algorithm to
provide reliable and sensible output [22] .
2) Data Transformations: In this case data transformation pose
a challenge for data mining tools. Today, the data needed to
transform can only be gotten from two different sources, one of
which an active and operational system for the data warehouse
to be built and secondly it should include some activities
that involves assigning new columns, binning data and also
aggregating the data as well. In the first process, it is needed to be
modified infrequently that is only when there is a change in the
site and lastly the set of the transformed data gives a significantly
great challenge in the data mining process [22] .
3) Scalability of Data Mining Algorithms: With yahoo which has
over 1.2 billion page views in a day with the presence of large
amount of data, scalability arises with significant issues;
• Due to the large amount of data size gathered from the website
at a reasonable time, the data mining algorithm can handle or
process it as much as it’s needed especially because of the scale
nonlinearly.
• The models that are generated tends to be too complicated for
individuals to understand how it is interpreted [22] .
4) Make Data Mining Models Comprehensible to Business Users:

The results of data mining should be clearly understood by
business users, from the merchandisers who are in charge of
decision making to the creative designers that design the sites to
marketers to spend advertising money. The challenge is to design
and define extra model types and a strategic way to present them to
business users, what regression models can we come up with and
how can we present them? (Even linear regression is usually hard
for business users to understand.) How can we present nearest-
neighbor models, for example? How can we present the results
of association rule algorithms without overwhelming users with
tens of thousands of rules? [22] .
5) Support Slowly Changing Dimensions: The demographic aspect
of visitors change, in that they may get married, there is an
increase in salaries or income, the rapid growth of their children,
needs which are the bases on which it is modelled changes. Thus,
the products attributes also change, in terms of new choices may
be available, the design and the way the products or service
is packaged and also the increase or degrade of quality. These
attribute that change over time are often known as “Slowly
Changing Dimensions”. In this case the main challenge here is
to keep track of those changes and in the same vein providing
support for the identified change in the analysis [2] .
6) Make Data Transformation and Model Building Accessible to
Business Users: Having the ability to provide definite answers to
questions by individual business users, this requires the aspects
of data transformations but with the technical understanding of
the tools used in the analysis. Many commercials report designers
and also online analytical processing (OLAP) tools are basically
hard to understand by business users. In this case, two preferred
solutions are (I) provision of templates, (e.g. online analytical
processing cubes and recommended transformations for mining)
for the expected questions and (ii) provision of the experts via
consultation or even a service organization. This mentioned
challenge basically is to find a way to enrich the business users to
as to be able to analyze the information themselves without and
hiccups [2] .
SUMMARY AND CONCLUSION

Data mining for e-commerce companies should no longer be privilege but
requirement in order to survive and remain relevant in the competitive
environment. On one hand, data mining offers number of benefits to e-com-
merce companies and allows them to do merchandise planning, analyze
customers’ purchasing behaviors and forecast their sales which in turn would
place them over other companies and generate more revenue. On the other
hand, there are certain challenges of data mining in the field of e-commerce
such as spider identification, data transformation, scalability of data mining
algorithms, making data mining model comprehensible to business users,
support slow changing dimensions and making data transformation and
model building accessible to business users.
The data collected about customers and their transactions, which are the
greatest assets of e-commerce companies, needs to be used consciously for
the benefits of the companies. For such companies, data mining plays an
important role in providing customer-oriented services to increase customer
satisfaction. It has become apparent that utilizing data mining tools is a
necessity for e-commerce companies in this global competitive environment.
Although the complexity and granularity of the mentioned challenges
differ, e-commerce companies can overcome these problems by using and
applying the right techniques. For example, developing e-commerce website
in a way that search engines can read and access the latest version of the
website, help companies to overcome the search engine spider identification
problem.
Another hot topic in e-commerce data mining is cloud computing which
is also covered in this paper. While the need of data mining tools is growing
every day, the ability of integrating them in cloud computing becomes
more stringent. It is obvious that making good use of the cloud computing
technology in e-commerce helps effective use of resources and reduces costs
for companies that enable efficient data mining.
REFERENCES
1. Cao, L., Li, Y. and Yu, H. (2011) Research of Data Mining in Electronic
Commerce. IEEE Computer Society, Hebei.
2. Bhagyashree, A. and Borkar, V. (2012) Data Mining in Cloud
Computing. Multi Conference (MPGINMC-2012). http://reserach.
ijcaonline.org/ncrtc/number6/mpginme1047.pdf
3. Rao, T.K.R.K., Khan, S.A., Begun, Z. and Divakar, Ch. (2013)
Mining the E-Commerce Cloud: A Survey on Emerging Relationship
between Web Mining, E-Commerce and Cloud Computing.
IEEE International Conference on Computational Intelligence
and Computing Research, Enathi, 26-28 December 2013, 1-4.
http://dx.doi.org/10.1109/iccic.2013.6724234
4. Wu, M., Zhang, H. and Li, Y. (2013) Data Mining Pattern Valuation
in Apparel Industry E-Commerce Cloud. IEEE 4th International
Conference on Software Engineering and Service Science (ICSESS),
689-690.
5. Srinniva, A., Srinivas, M.K. and Harsh, A.V.R.K. (2013) A Study on
Cloud Computing Data Mining. International Journal of Innovative
Research in Computer and Communication Engineering, 1, 1232-1237.
6. Carbone, P.L. (2000) Expanding the Meaning and Application of Data
Mining. International Conference on Systems, Man and Cybernetics,
3, 1872-1873. http://dx.doi.org/10.1109/icsmc.2000.886383
7. Barry, M.J.A. and Linoff, G.S. (2004) On Data Mining Techniques
for Marketing, Sales and Customer Relationship Management.
Indianapolis Publishing Inc., Indiana.
8. Pan, Q. (2011) Research of Data Mining Technology in Electronic
Commerce. IEEE Computer Society, Wuhan, 12-14 August 2011, 1-4.
http://dx.doi.org/10.1109/icmss.2011.5999185
9. Verma, N., Verma, A., Rishma and Madhuri (2012) Efficient and
Enhanced Data Mining Approach for Recommender System.
International Conference on Artificial Intelligence and Embedded
Systems (ICAIES2012), Singapore, 15-16 July 2012.
10. Kamba, M. and Hang, J. (2006) Data Mining Concept and Techniques.
Morgan Kaufmann Publishers, San Fransisco.
11. News Stack (2015). http://thenewstack.io/six-of-the-best-open-source-
data-mining-tools/
12. Witten, I.H. and Frank, E. (2014) The Morgan Kaufmann Series
on Data Mining Management Systems: Data Mining. 2nd Edition,
Publisher Morgan Kaufmann, San Francisco, 365-528.
13. Liu, X.Y. And Wang, P.Z. (2008) Data Mining Technology and Its
Application in Electronic Commerce. IEEE Computer Society, Dalian,
12-14 October 2008, 1-5.
14. Zeng, D.H. (2012) Advances in Computer Science and Engineering.
Springer Heidelberg, NewYork.
15. Ralph, K. and Caserta, J. (2011) The Data Warehouse ETL Toolkit:
Practical Techniques for Extraction, Cleaning, Conforming and
Delivering Data. Wiley Publishing Inc., USA.
16. Michael, L.-W. (1997) Discovering the Hidden Secrets in Your Data—
The Data Mining Approach to Information. Information Research, 3.
http://informationr.net/ir/3-2/
17. Li, H.J. and Yang, D.X. (2006) Study on Data Mining and Its Application
in E-Business. Journal of Gansu Lianhe University (Natural Science),
No. 2006, 30-33.
18. Raghavan, S.N.R. (2005) Data Mining in E-Commerce: A Survey.
Sadhana, 30, 275-289. http://dx.doi.org/10.1007/BF02706248
19. Michael, J.A.B. and Gordon, S.L. (1997) Data Mining Techniques: For
Marketing and Sales, and Customer Relationship Management. 3rd
Edition, Wiley Publishing Inc., Canada.
20. Wang, J.-C., David, C.Y. and Chris, R. (2002) Data Mining Techniques
for Customer Relationship Management. Technology in Society, 24,
483-502.
21. Christos, P., Prabhakar. R. and Jon, K. (1998) A Microeconomic View
of Data Mining. Data Mining and Knowlege Discovery, 2, 311-324.
http://dx.doi.org/10.1023/A:1009726428407
22. Yahoo (2001) Second Quarter Financial Report. Yahoo Inc., Califonia.
Chapter 14
Research on Realization of
Petrophysical Data Mining Based on Big
Data Technology
Yu Ding1,2, Rui Deng2,3, Chao Zhu4

School of Computer Science, Yangtze University, Jingzhou, China
1
2

School of Geophysics and Oil Resource, Yangtze University, Wuhan, China
3
The Internet and Information Center, Yangtze University, Jingzhou, China

4
ABSTRACT
This paper studied the interpretation method of realization of data mining
for large-scale petrophysical data, which took distributed architecture, cloud
computing technology and B/S mode referred to big data technology and
data mining methods. Based on petrophysical data mining application of
K-means clustering analysis, it elaborated the practical significance of
application association with big data technology in well logging field, which
Citation: Ding, Y. , Deng, R. and Zhu, C. (2018), “Research on Realization of Petro-

physical Data Mining Based on Big Data Technology”. Open Journal of Yangtze Oil
and Gas, 3, 1-10. doi: 10.4236/ojogas.2018.31001.
also provided a scientific reference for logging interpretation work and data
analysis and processing method to broaden the application.
Keywords: Big Data Technology, Data Mining, Logging Field Method
INTRODUCTION
With the increasing scale of oil exploration and the development of
engineering field, the application of high-tech logging tools is becoming
more and more extensive. The structural, semi-structured and unstructured
complex types of oil and gas exploration data are exploded. In this paper,
the petrophysical data was taken as the object; Big data technology and data
mining methods were used for data analysis and processing, which mines
effective and available knowledge to assist routine interpretation of work
and to broaden the scientific way to enhance the interpretation of precision.
The research allows full play to great potential of logging interpretation for
comparative study of geologic laws and oil and gas prediction.
The rapid development of network and computer technology as well
as the large-scale use of database technology makes it possible to extract
effective information from petrophysical data in more different ways
adopted by logging interpretation. Relying on the traditional database query
mechanism and mathematical statistical analysis method, it is difficult to
satisfy the effective processing of large-scale data. It tends to be that the
data contains a lot of valuable information, but it cannot be of efficient
use because the data is in an isolated state and cannot be transformed into
useful knowledge applied to logging interpretation work. Too much useless
information will inevitably lead to the loss of information distance [1] and
useful knowledge which is in the “rich information and lack of knowledge”
dilemma [2] .
ANALYSIS OF BIG DATA MINING OF

PETROPHYSICAL DATA
Processing Methods of Big Data

Big data can be taken as the reasons for the basis of the data scale, and it
is difficult to use existing software tools and mathematical methods in a
reasonable time to achieve the analysis and processing of data which has the
features of large scale, complex structure and many types [3] .
Research on Realization of Petrophysical Data Mining Based on Big Data... 275
At present, the amount of rock physical data information gradually

increases more and more types, which is consistent with the basic
characteristics of big data. With the advantages of cloud computing in
data processing performance and the good characteristics of distributed
architecture, the existing C/S mode interpretation method is transformed
into B/S mode on basis of distributed architecture. Then, the situation, in
which processing capacity of the original client single node is insufficient,
can be handled through increasing the horizontal scaling of the monomer
processing node and the node server in the condition of the rational allocation
and the use of system resources. Meanwhile, the on-line method is adopted
for the analysis and processing of the petrophysical data which can store
the data mining results and analysis process in the server. Interpreters can
interpret process documents through querying the server-side to make a
more reasonable explanation of the logging data in the unknown area or
the same type of geological conditions, which can achieve the change of
data sharing from the lower stage (data sharing) to the advanced stage
(knowledge sharing).
The essence of big data processing methods can be seen as the development
and extension of grid computing and prior distributed computing. The
significance of big data processing does not just lie in the amount of data,
but in these massive available data resources in which valuable information
can be gained quickly and effectively while the available mode can be mined
and the purpose of acquiring new knowledge can be achieved.
Overview of Big Data Technology
Distributed System Architecture

Distributed file system is mainly used to achieve data access of the local
underlying and the upper-level file system. It is the software system on the
basis of the network with a high degree of cohesion and transparency. The
distributed system architecture can be considered as the software architecture
design that operates in multiple processors. This paper chooses HDFS open
source distributed file system to build software operating environment [4] .
HDFS system architecture shown in Figure 1 adopts master/slave
architecture, and an HDFS cluster is composed of a Namenode and a number
of Datanodes. The Namenode node is used to manage the namespace of the
file system and to handle client access to the file. The Datanodenode is used
to manage the literacy requests of the storage and processing of the file
system clients on its nodes.
Figure 1: HDFS system architecture.
Cloud Computing Technology

Cloud computing is the Internet-based computing which has been put
forward on the basis of the context of the development, being stuck in the
bottlenecks, of the traditional computer storage technology and computing
capacity (Figure 2) [5] [6] . By sharing hardware resources and information
to cluster network nodes, large-scale parallel can be achieved and distributed
computing to enhance the overall computing power of the system. Combined
with the study content of the paper, the cloud computing is applied to the
mining of petrophysical data, which can meet the computing requirements of
the mining algorithm to solve the problem of insufficient processing capacity
of the client nodes in the traditional C/S mode which is the conversion basis
of B/S distributed online processing mode.
Figure 2: Cloud computing architecture.
The Combination and Application of Data Mining Methods
Clustering Mining Method

Data clustering is one of the important tasks of data mining. Through
clustering, it is possible to clearly identify the regions between inter-class
and intra-class of data concentration, which is convenient to understand
the global distribution pattern and to discover the correlation between data
attributes [7] .
In the pattern space S, if given N samples X1, X2, ∙∙∙, Xn, the clustering
is defined to find the corresponding regions R1, R2, ∙∙∙, Rm, according
to the similarity degree of each other; any of Xi (i = 1, 2, ∙∙∙, N) is
classified into only one instead of the two classes at the same time, to wit,
and [8] . Clustering
analysis is mainly based on some features of the data set to achieve division
according to the specific requirements or rules, which satisfies the following
two characteristics under normal circumstances: intra-class similarity,
namely, that data items in the cluster should be as similar as possible; inter-
class dissimilarity, namely, that data items in the heterogeneous cluster
should be as different as possible [9] .
Petrophysical Data Clustering Mining Analysis

At present, the analysis and accurate description of sedimentary facies,
subfacies and microfacies for favorable reservoir facies zones are an
important work in current oilfield exploration and development. The study
of sedimentary facies is carried out on the basis of the composition, structure
and sedimentary parameters under the guidance of phase pattern and phase
sequence. The petrophysical data contains much potential stratigraphic
information, and the lithology of the strata often leads to a certain difference
in the sampling value of the logging curve. This difference can be seen
as the common effects of many factors, such as the lithological mineral
composition, its structure and the fluid properties contained in the pores.
Because of this, one logging physical value also means some particular
lithology of corresponding strata. Coupled with the difference of the
formation period and the background, then the combination of the inherent
physical characteristics of rock stratum in different geological periods
and some random noise is used to achieve the purpose of lithological and
stratigraphic division.
MINING BASED ON K-MEANS CLUSTERING

ANALYSIS
K-Means Algorithm Principle

Assuming that there is a set of elements, the goal of K-means is to divide
the elements of the set into K clusters or classes so that the elements within
each cluster have a high degree of similarity while the similarity of elements
of different clusters is low, namely, similar elements are clustered into a
collection, eventually forming the multiple clustering clustered by feature-
similar elements [10] .
K-means first randomly generates k objects from n data as the initial
clustering center while the rest of the data objects are clustered by calculating
the similarity (distance) of each data to the clustering centers (minimum
distance between the two points), to divide the data object into the class, and
then to recalculate the new cluster of the class center formed by each cluster
(cluster the mean of all data objects) to update the cluster class center as the
next class center of the iterations. It repeats the clustering process until the
criterion function begins to converge.
In this paper, the Euclidean distance is taken as the discriminant condition
of similarity measure, and the criterion function Er is defined as the error
sum of the squares of all the data objects to the class center. Obviously, the
purpose of the K-means algorithm is to find K divisions of the data set based
on the optimal criterion function.
(1)
Here, X represents a data object in the data set; Ci represents the ith
cluster, and represents the mean of cluster Ci.
Lithological Division Based on K-Means

The logging physics values of the same layer lithology are relatively stable
and generally do not exceed an allowable error. The mean value of the
samples in the same layer can be used to represent the overall true value of
the similar parts of the surrounding lithology. When the difference between
the value of the adjacent sampling point and the mean is within the given
error range, the lithological type of the point can be replaced by the lithology
corresponding to the mean. Otherwise, it will proceed with the search for the
home class until the division of all sampling points is completed. In order
to facilitate the study, this paper selects the natural gamma logging curve
with strong longitudinal resolution for the division of lithology, while the
other passive curves are selected to adjust the division results to improve
the accuracy of the decision outcomes in the completion of the lithological
division at the same time.
For any two points in the plane (X1, Y1) and (X2, Y2), the Euclidean
distance is as follows,
(2)
Here, Figure 3 is taken as the example to show the clustering process of
K-means petrophysical data. In Figure 3(a), the black triangles are labeled in
two-dimensional space with two-dimensional eigenvectors as coordinates.
They can be regarded as examples reflected by two-dimensional data
(composed of the data of two logging curves), that is, primitive petrophysical
data sets in need of clustering. Three different colored boxes represent the
clustering center points (analogical to some lithology) given by random
initialization. Figure 3(b) shows the results of the completion of clustering,
that is, to achieve the goal of lithological division. Figure 3(c) shows the
trajectory of the centroid in the iterative process.
Figure 3: Clustering process of petrophysical data.

The program flow chart is shown in Figure 4 as follow.
Figure 4: Program flow chart.

Software Implementation
Distributed Architecture and Cloud Computing Environment

Hadoop operates three modes―the stand-alone, pseudo-distributed and
fully distributed. Taking into account the test environment required for the
simulation software operation and the main content of this study with the
combination of methods to the application, the test environment adopts
Hadoop’s fully distributed mode in which VMware vSphere 5.5 is used
to build another two virtual machines with the CentOS 6 Linux system in
the high-performance server equipped with CentOS 6 and the distributed
computing is done by three nodes in the cluster (Table 1). Different from the
physical node, the cluster node is the use of software virtual composition
and the actual operation of the process with differences in performance.
Table 1: Description of the hosts and terminals in the cluster
Hosttype Host name OS IPaddress Nodetype
terminal localhost Windows 7 10.102.10.35 -
hostmachine test.com CentOS 6 10.211.6.1 -
virtualmachine_1 master CentOS 6 10.211.40.7 master
virtualmachine_2 slave_1 CentOS 6 10.211.40.8 slave
virtualmachine_3 slave_2 CentOS 6 10.211.40.9 slave
Application and Analysis

A total of three production wells in the SZ development Zone of an oilfield
are selected to complete the conventional logging interpretation pretreatment
by using the collected core material of core section of well walls, relatively
complete logging data, geological and drilling data combined with the actual
geological conditions, in which the samples with possible existence of the
borehole diameter, too large proportion of mud and too high viscosity, leading
to measurement curve distortion of the logging instrument, are selected to
choose the sample data with the true reflection of the strata information.
Then, according to the description of the reservoir performance and the
actual division of the corresponding oil and gas standards, the lithology of
the working area is divided into four distinct divisions, namely, sandstone,
argillaceous sandstone, sandy mudstone and mudstone combined with the

core material.
After the K-means algorithm and the lithological judgment condition
are programmed, the B/S mode and the cloud computing technology are
used to divide the well section lithology of the petrophysical data mining
program in the cluster in the built fully distributed simulation environment
of Hadoop. The accuracy of lithology is about 78%, and the accuracy of
sandstone and mudstone is relatively high which is more than 85%, and the
results are shown in Figure 5.
Figure 5: Data mining result.

The data in Figure 5 shows that the same notions of SPLI values indicate
that they are of the same layer, i.e., lithological consistency or similarity
viewed from the results of artificial stratification. Compared with the data
mining results, due to the difference of the value of the empirical coefficient
in the stratigraphic age, the division results based on the discriminant
conditions are different in some logging sampling points. According to the
correction process of the core data, the result is related to the value of the
empirical coefficient. For a certain section of stratum, the value of 2 may
have a relatively high degree of coincidence. Similarly, the value of 3.7 of
some layers have a relatively high degree of coincidence. This also shows
that the general selection of the single empirical coefficient may have an
impact on the accuracy of the interpretation results. On the one hand, viewed
form the upper and lower adjacent types, the results are identified correctly
which belong to the same kind of lithology. On the other hand, viewed from
the comparison of the results of the left and right division, the lithological
division has changed while the corresponding SPLI value is not exactly the
same, indicating that the data have the value of further fine study.
Therefore, in-depth study of inconsistent results of lithological division

can help to find valuable unexpected pattern in petrophysical data. Compared
with the experimental methods and empirical methods, this method of
“Data to talk”, from which the potential correlation and extract knowledge
are explored, leads to the more objection and science for some regional
empirical parameter values and empirical formulas in scientific induction
and summary.
Figure 6 shows the time consumed by executing the same program in a
stand-alone node and in a distributed environment, which is 616,273 ms and
282,697 ms respectively. Here, the optimization of the algorithm, compiler
selection and hardware device performance differences and other factors
are considered comprehensively, and only described from the qualitative
perspective, the use of distributed computing can reduce or significantly
reduce the time spent on large-scale data processing to improve the overall
performance of the system to a certain extent, thus the feasibility of using
the big data technology to realize the petrophysical data mining is verified.
Figure 6: Running time of program in Windows and HDFS.
CONCLUSIONS
1) The advantages of distributed architecture and cloud computing
are used to improve the overall processing capacity of the system,
and in the process of large-scale petrophysical data processing,
the B/S mode is integrated to achieve data mining to combine
big data analysis and processing mechanism with conventional
interpretation. The exploratory research idea of the new method
of logging interpretation is put forward, with the starting point of

discovering the novel knowledge, to provide a scientific reference
for the routine widening applied by the interpretation work and
data analysis methods.
2) The combination of multidisciplinary knowledge and the rational
application of cross technology can perfect the deficiencies in
the existing logging interpretation to a certain extent, making
the interpretation work of qualitative analysis and quantitative
computing of logging data more scientific, with favorable theory
and practical guidance significance.
ACKNOWLEDGEMENTS
This work is supported by Yangtze University Open Fund Project of key
laboratory of exploration technologies for oil and gas resources of ministry
of education (K2016-14).
REFERENCES
1. Wang, H.C. (2006) DIT and Information. Science Press, Beijing.
2. Wang, L.W. (2008) The Summarization of Present Situation of Data
Mining Research. Library and Information, 5, 41-46.
3. Pan, H.P., Zhao, Y.G. and Niu, Y.X. (2010) The Conventional Well
Logging Database of CCSD. Chinese Journal of Engineering
Geophysics, 7, 525-528.
4. Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003) The Google File
System. ACM SIGOPS Operating Systems Review, 37, 29-43. https://
doi.org/10.1145/1165389.945450
5. Sakr, S., Liu, A., Batista, D.M., et al. (2011) A Survey of Large
Scale Data Management Approaches in Cloud Environments. IEEE
Communications Surveys & Tutorials, 13, 311-336. https://doi.
org/10.1109/SURV.2011.032211.00087
6. Low, Y., Bickson, D., Gonzalez, J., et al. (2012) Distributed GraphLab:
A Framework for Machine Learning and Data Mining in the Cloud.
Proceedings of the VLDB Endowment, 5, 716-727. https://doi.
org/10.14778/2212351.2212354
7. Song, Y., Chen, H.W. and Zhang, X.H. (2007) Short Term Electric Load
Forecasting Model Integrating Multi Intelligent Computing Approach.
Computer Engineering and Application, 43, 185-188.
8. Abraham, B. and Ledolter, J. (1983) Statistical Methods for Forecasting.
John Wiley & Sons, Inc., NewJersey.
9. Farnstrom, F., Lewis, J. and Elkan, C. (2000) Scalability for Clustering
Algorithms Revisited. AcmSigkdd Explorations Newsletter, 2, 51-57.
https://doi.org/10.1145/360402.360419
10. Rose, K., Gurewitz, E. and Fox, G.C. (1990) A Deterministic Annealing
Approach to Clustering. Information Theory, 11, 373.
SECTION 4:
INFORMATION PROCESSING METHODS
Chapter 15
Application of Spatial Digital

Information Fusion Technology in
Information Processing of National
Traditional Sports
Xiang Fu, Ye Zhang, and Ling Qin

510000, China
ABSTRACT
The rapid development of digital informatization has led to an increasing
degree of reliance on informatization in various industries. Similarly, the
development of national traditional sports is also inseparable from the
support of information technology. In order to improve the informatization
development of national traditional sports, this paper studies the fusion
process of multisource vector image data and proposes an adjustment and
merging algorithm based on topological relationship and shape correction
for the mismatched points that constitute entities with the same name. The
algorithm is based on topological relationship. The shape of the adjustment
Citation: Xiang Fu, Ye Zhang, Ling Qin, “Application of Spatial Digital Informa-
tion Fusion Technology in Information Processing of National Traditional Sports”, Mo-
bile Information Systems, vol. 2022, Article ID 4386985, 10 pages, 2022. https://doi.
org/10.1155/2022/4386985.
Copyright: © 2022 by Authors. This is an open access article distributed under the Cre-
ative Commons Attribution License, which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
and merging algorithm is modified, and finally, a set of national traditional

sports information processing system is constructed by combining the
digital information fusion technology. Experiments have proved that the
method of national traditional sports information processing based on digital
information fusion technology proposed in this paper is effective and can
play a good role in the digital development of national traditional sports.
INTRODUCTION
National traditional sports, as the carrier of China’s excellent traditional
culture, has been preserved in the state of a living fossil after changes
in the times and social development. The 16th National Congress of the
Communist Party of China regards the construction of socialist politics,
economy, and culture with Chinese characteristics as the basic program
of the primary stage of socialism and cultural prosperity as an important
symbol of comprehensive national strength. Therefore, cultural construction
will also be a focus of today’s new urbanization process. If you want to
achieve cultural development in urbanization, you need to use a medium,
and traditional national sports is a good medium. Traditional national
sports equipment is multifunctional, including fitness, entertainment, and
education. Moreover, the content is rich, the forms are diversified, and the
forms of sports are eclectic, regardless of age, suitable for men, women,
and children. It can be said that traditional national sports activities are an
indispensable part of people’s lives.
An information platform is an environment created for the construction,
application, and development of information technology, including the
development and utilization of information resources, the construction of
information networks, the promotion of information technology applications,
the development of information technology and industries, the cultivation of
information technology talents, and the formulation and improvement of
information technology policy systems [1]. The network platform has the
greatest impact and the best effect in the construction of the information
platform. Therefore, it is necessary to make full use of network technology,
communication technology, control technology, and information security
technology to build a comprehensive and nonprofit information platform for
the traditional Mongolian sports culture. The information platform includes
organizations, regulatory documents, categories, inheritors, news updates,
columns, performance videos, and protection forums. Moreover, it uses text,
pictures, videos, etc., to clearly promote and display the unique ethnicity of
Application of Spatial Digital Information Fusion Technology in ... 291
Mongolian traditional sports culture in terms of clothing, event ceremonies,

etiquette, techniques, customs, and historical inheritance. In addition, its
display has been circulated and fused with ethnic imprints of different
historical periods, regions, nations, and classes. The information platform
can popularize relevant knowledge, push and forward messages, publish hot
topics and comments, and enhance the interaction between users. Therefore,
the platform has become a carrier for the public to obtain knowledge of
Mongolian national traditional sports culture and a shortcut for consultation
and communication [2].
The emergence and application of computer technology mark the
beginning of a new digital era for human beings, which has greatly changed
the existing way of life and the form of information circulation [3].
In-depth study of the main problems and obstacles in the development
of digital sports in my country and active exploration of development paths
and strategies will not only promote the rapid development of the sports
performance market, the sports fitness market, and the sports goods market
but also expand the space for sports development. Improving the physical
fitness of the whole people, cultivating sports reserve talents, and enriching
the sports and cultural life of the masses have a certain impact. At the same
time, advanced sports culture, healthy sports lifestyle, and scientific exercise
methods will be integrated into the daily life of the public, attracting more
people to participate in sports activities, experience the vitality of sports, and
enjoy the joy of sports. Eventually, realize the great idea of transforming a
sports power into a sports power, and complete the leap-forward development
of the nationwide fitness industry and the sports industry.
This article combines digital information fusion technology to construct
the national traditional sports information processing system, so as to improve
the development effect of national traditional sports in the information age.
RELATED WORK
Integrating “digitalization” into sports can understand the concept of “digital
sports” from both narrow and broad directions [4]. In a broad sense, digital
sports is a new physical exercise method that combines computer information
technology with scientific physical exercise content and methods. It can help
exercisers improve their sports skills, enhance physical fitness, enrich social
leisure life, and promote the purpose of spiritual civilization construction. In
a narrow sense, digital sports is a related activity that combines traditional
sports with modern digital means. Through advanced digital technology,

traditional sports exercises are reformed and sublimated, so as to achieve
the purpose of scientifically disseminating sports knowledge and effectively
improving physical skills [5]. Digital sports is a brand-new concept. It
realizes the perfect combination of digital game form with competitive
fitness, physical exercise, and interactive entertainment through technical
means such as the Internet, communications, and computers [6]. It comes
from the combination of traditional sports and digital technology [7]. At the
same time, digital sports also involves cross fields such as cultural content,
computer information, and sports.
The emergence of digital sports has freed the public from the limitation
of venues in traditional physical exercise. Volkswagen is no longer limited
to wide and diversified sports venues and helps the public make best use of
existing sports venues such as community open space, small squares, street
roads, and parks to the fullest extent. For example, the Wii, a digital home
game console sold by Nintendo of Japan, features an unprecedented use of
a stick-shaped mobile controller “Wii Remote” and classic sports games.
It uses Wii Remote’s motion sensing and pointing positioning to detect
three-dimensional space. Flip and complete the movement “somatosensory
operation.” Wii Remote is used as a fishing rod, baton, tennis racket, drum
stick, and other tools in different games to help players complete exercise
through somatosensory operations such as shooting, chopping, swinging,
and spinning [8]. Undoubtedly, digital sports methods that get rid of the
constraints of the geographical environment can not only increase the
enthusiasm of all people to participate in sports but also expand the existing
sports population. What is more important is that digital sports can break free
from geographical constraints and no longer be restricted by past stadiums
[9].
Digital sports are conducive to meeting the sports participation needs
of different groups of people. For companies that develop digital sports,
whoever first captures the digital sports market of special groups such as
middle-aged and elderly, children, and women will occupy the commanding
heights on the battlefield of digital sports [10]. The emergence of digital sports
brings advanced and scientific sports training methods and exercise content
to ordinary sports enthusiasts, changing the disadvantages of the past biased
research on young people and more satisfying the needs of different groups
such as children, the elderly, and women. At the same time, its appearance can
also help different sports hobby groups set multilevel exercise goals, find the
best exercise plan, and form a serialized and intelligent digital sports service
system [11]. Regardless of the age of the participants, high or moderate
weight, and female or male, digital sports methods will provide them
with the most suitable activity method to help different sports enthusiasts
complete exercise and demonstrate the charm of sports [12]. Digital sports
deeply analyzes the activity habits or exercise methods of the elderly,
women, children, and other special groups and provides more suitable sports
services for every sports enthusiast [13]. Through local computing, digital
sports accurately locates and perceives the personalized and unstructured
data of different audience groups and conducts comprehensive analysis and
processing of various data information in a short period of time, forming a
portable mobile device for each sports group. In order to find out the real
needs of more sports audiences, put forward effective exercise suggestions
to help different exercise groups reach the best exercise state [14]. Through
the connection of the bracelet and the digital sports terminal, the public can
also see the comparison chart of the comprehensive sports data of different
participating groups more intuitively, assist the public to set personalized
sports goals, and urge each athlete to complete their own exercise volume.
In the end, every exerciser’s exercise method and exercise effect will be
improved scientifically and reasonably over time [15].
SPACE DIGITAL FUSION TECHNOLOGY

This article applies spatial digital fusion technology to the national sports
information processing. Combining the reality and needs of national sports,
this article analyzes the spatial digital integration technology. First, the
digital coordinate system is established.
The mathematical formula for transforming digital coordinates to spatial
rectangular coordinates is shown in formula (1) [16].
(1)
Among them, B is the latitude of the earth, L is the longitude of the
earth, H is the height of the earth, (X,Y,Z) is the rectangular coordinates of
the space, is the radius of curvature of the ellipsoid, and
is the eccentricity of the ellipse (a and b represent the long
and short radii of the ellipse, respectively) [17].
When converting from spatial rectangular coordinates to digital

coordinates, the geodetic longitude can be obtained directly. However, the
calculation of the geodetic latitude B and the geodetic height H is more
complicated, and it is often necessary to use an iterative method to solve the
problem. From formula (1), the iterative formula can be solved as
(2)
In the iterative process, the initial value is
. According to formula (2), B can be obtained by approximately four
generations, and then, H can be obtained.
Figure 1 shows two spatial rectangular coordinate systems
. Among them, the same point in the
two rectangular coordinate systems has the following correspondence [18]:
(3)
Figure 1: Conversion between two spatial rectangular coordinate systems.

Among them, there are
(4)
is the coordinate translation parameter, is the
coordinate rotation parameter, and k is the coordinate scale coefficient. In
practical applications, the determination of the conversion parameters in
the above-mentioned two Cartesian coordinate conversion relations can
be determined by using the least squares method of the common point
coordinates.
In mathematics, projection refers to the establishment of a one-to-
one mapping relationship between two point sets. Image projection is to
express the graticule on the sphere of the earth onto a plane in accordance
with a certain mathematical law. A one-to-one correspondence function is
established between the digital coordinates (B,L) of a point on the ellipsoid
and the rectangular coordinates (x,y) of the corresponding point on the
image. The general projection formula can be expressed as [19]
(5)
In the formula, (B,L) is the digitized coordinates (longitude, latitude) of
a point on the ellipsoid, and (x,y) is the rectangular coordinates of the point
projected on the plane.
The transformation of the positive solution of Gaussian projection is
as follows: given the digitized coordinates (B,L), the plane rectangular
coordinates (x,y) under the Gaussian projection are solved. The formula is
shown in
(6)
In the formula, X represents the arc length of the meridian from
the equator to latitude represents the radius

of curvature of the circle, L0 represents the longitude of the origin,
represents the difference between the longitude of the ellipsoid
point and the corresponding central meridian, and the auxiliary variables
, respectively, represent the
long radius, short radius, and second eccentricity of the reference ellipsoid.
The inverse solution transformation of the Gaussian projection is
as follows: the plane rectangular coordinates (x,y) under the Gaussian
projection are known, and the digitized coordinates (B,L) are solved. The
calculation formula is shown in [20]
(7)
Among them, the variable represents the latitude

of the location, that is, the latitude value corresponding to the meridian
calculated from the equator. The longitude and latitude values obtained by
the inverse Gaussian transformation method are actually a relative quantity,
which is the difference between the longitude and latitude relative to the
lower left corner of the figure. Therefore, to get the final correct digitized
coordinates, the longitude and latitude values of the lower left corner point
need to be added.
The transformation of the positive solution of the Mercator projection

is as follows: given the digitized coordinates (B,L), the plane rectangular
coordinates (x,y) under the Mercator projection are calculated, and the
formula is as shown in [21]
(8)
In the formula, L0 is the longitude of the origin,
is called the reference latitude.
When B=0, the cylinder is tangent to the earth ellipsoid, and the radius of
the tangent cylinder is a.
The inverse solution transformation of the Mercator projection is as
follows: given the plane rectangular coordinates (x,y) under the Mercator
projection, the digitized coordinates (B,L) are calculated, and the formula
is shown in
(9)
In the formula, exp is the natural logarithm base, and the latitude B is
quickly closed by iterative calculation.
For the geometric matching method of point entities, the commonly used
matching similarity index is Euclidean distance. The algorithm compares
the calculated Euclidean distance between the two with a threshold, and the
one within the threshold is determined to be an entity with the same name
or may be an entity with the same name. If multiple entities with the same
name are obtained by matching, then repeated matching can be performed
by reducing the threshold or reverse matching. If the entity with the same
name cannot be matched, it can be adjusted by appropriately increasing the
threshold until the entity with the same name is matched. The calculation
formula of Euclidean distance is shown in
(10)
where D is the Euclidean distance between two point entities

.
For the geometric matching method of line entities, the total length L
of the line, the direction θ of the line, the maximum chord Lmax of the line,
etc., are usually used as the matching similarity index. Its definition and
calculation formula are as follows:
The Total Length of the Line

The total length of the line is defined as the sum of the lengths of the subline
segments that make up the line segment. We assume that the points that
make up the line are , as shown in Figure 2. The total length of
the line is calculated as
(11)
Figure 2: The direction of the line.
The Direction of the Line

The direction of the line is defined as the angle between the line between the
first end point and the end point and the x-axis. It is specified that clockwise
is positive and counterclockwise is negative, as shown in Figure 3. We
assume that the first end point of the line is , and the end point is
, and then, the calculation formula for the direction of the line is
shown in formula (11).
Figure 3: The direction of the line.
The Maximum Chord of the Line

The maximum chord of a line is defined as the distance between the two
furthest points that make up the line, as shown in Figure 4, and the calculation
formula is shown in
(12)
Figure 4: Maximum chord of line.

The main idea of the graph data adjustment and merging algorithm based
on the same-name point triangulation is as follows: in the “reference map”
and “adjustment map,” a topologically isomorphic Delaunay triangulation is

constructed with the matched points of the same name as the starting point.
After the corresponding regions in the two figures are divided into small
triangles, the coordinate conversion relationship is established through the
three vertices of each small triangle, and the other points falling into the
triangle undergo coordinate conversion according to this relationship.
Figure 5 is a part of the triangulation network constructed according to
the above method in the two vector diagrams, where ΔABC and ΔA′B′C′
are, respectively, triangles formed by pairs of points with the same name in
the two vector diagrams.
Figure 5: Example of partial division of triangulation.

The linear transformation relationship between △ABC and △A′B′C′ is
shown
(13)
Among them, are the vertex coordinates of △ABC

and △A′B′C′, respectively. Bring the coordinates of the three vertices of the
triangle into formula (13), then the coefficients in F can be obtained, and the
transformation formula can be obtained.
The basic idea of the graph adjustment merging algorithm based on the
principle of adjustment is as follows: The algorithm takes the coordinate
adjustment values of the points that constitute the entity as the parameter
to be solved, that is, the adjustment value correction number. Various error
formulas, such as displacement formula, shape formula, relative displacement
formula, area formula, parallel line formula, line segment length formula,
and distance formula of adjacent entities, are established according to
actual application needs. Finally, the calculation is carried out according to
the principle of least squares method of interrogation adjustment, and the
calculation formula is shown
(14)
(15)
In the formula, constraintk is the limit value of the k factor, is
the adjustment of the i-th entity coordinate point, and n is the total number of
entity coordinate points. A is the coefficient matrix of the adjustment model,
and v, x, and l are the corresponding residual value, parameter vector, and
constant vector, respectively.
The adjustment and merging algorithm based on topological relations
is mainly used to adjust the geometric positions of unmatched points in
entities with the same name. The basic idea is as follows: first, the algorithm
determines that the unmatched points that need to be adjusted are affected
by the matched points with the same name. Secondly, the algorithm analyzes
and calculates the geometric position adjustment of each matched point with
the same name. Finally, the algorithm uses the weighted average method to
calculate the total geometric position adjustment of the unmatched points.
We assume that the position adjustment of the last matched point P
is affected by N matched points with the same name , and
the distance from P$ to each matched point Qi with the same name is
. We assume that the coordinate adjustment amount of the
matched point Qi to the point P is , and then, the total adjustment
amount of the coordinate of the P point is calculated as
(16)
Among them, the weight .
Adjust and Merge Algorithm Based on Multiple Evaluation

Factors
This article divides the points that constitute entities with the same name
into two categories: points with the same name that are successfully matched
and points that are not successfully matched. The point with the same name
refers to the description of the same point on the entity with the same name
in different vector images. The point of the same name that is successfully
matched means that the point of the same name that constitutes the entity
of the same name on one of the vector images can find the matching point
of the same name on the corresponding entity of the same name in the other
vector image. The unmatched point means that the point that constitutes
the entity with the same name on one of the vector images cannot find a
matching point on the corresponding entity with the same name on the other
vector image. Because the points that constitute the entities with the same
name inevitably have positioning errors, it is inevitable that there will be
missing matches during the matching process of the points with the same
name. Therefore, there are two situations for the unmatched points; one is
the point with the same name, but the match is missed; the other is the point
with the same name. The classification of the points constituting the entity
with the same name is shown in Figure 6.
Figure 6: Classification of points constituting entities with the same name.

The average angle difference refers to the absolute average value of
the change of each turning angle of the entity before and after the entity is
adjusted, and its calculation formula is shown in formula (17). The average
angle difference can quantitatively describe the degree of change in the
shape of the entity before and after adjustment. The larger the average angle
difference, the greater the change in the shape of the graph before and after
the adjustment, and vice versa, and the smaller the change in the shape of the
graph before and after the adjustment.
(17)
In the formula, θifront and θiafter, respectively, represent the angle value
before and after the adjustment of the i-th turning angle that constitutes the
entity, and r represents the total number of turning angles that constitute the
entity.
In order to enable the entity adjustment and merging algorithm based on
topological relations to maintain the consistency of the shape of irregular
entities before and after the adjustment and merging, the wood text is an
indicator of the size of the shape change; that is, starting from the average
angle difference, an adjustment and merging algorithm based on topological
relations and shape correction is proposed for the points that are not
successfully matched on the entities with the same name. The detailed steps
of the algorithm are as follows:
(1) The algorithm first calculates the point that is not matched
successfully according to the adjustment and merging algorithm
based on the topological relationship; that is, the adjusted position
coordinates are calculated by formula (16)
(2) Based on the adjustment and merging algorithm of the topological
relationship, the shape correction is performed. According to the
principle that the last matched point on the entity with the same
name before and after the adjustment should maintain the same
angle as the two nearest matched points with the same name, the
adjusted position coordinates are calculated. As in Figure
7, we assume that A1, B1, A2, and B2 are the point pairs with the
same name that are successfully matched on the entities with
the same name in vector image 1 and vector image 2, where A1
matches A2 and B1 matches B2. They are adjusted to A′, B′ after
being processed by the entity adjustment and merging algorithm.
In the figure, X is the last matched point in vector image 1, and
the two matched points closest to X in this figure are A1 and B1,
respectively. Now, the algorithm adjusts and merges the point X
that is not successfully matched and finds its adjusted position
. Before the adjustment and merger, the angle

between X, A1, and B1 is ∠ A1XB1. In order to ensure that the
included angle remains unchanged before and after the adjustment
and merger, the adjusted and merged included angle ∠A′X′B′
should be equal to ∠A1XB1. Therefore, make the parallel lines
of l1 and l2 through A′ and B′, respectively, and the intersection
point of the two straight lines obtained is the desired
.
Figure 7: Schematic diagram of shape correction.

It should be noted that before the entity is adjusted and merged, if A1, B1,
and x are on the same straight line, then the shape, the mausoleum, and the
next step of this step can be omitted, and we only need to adjust the merged
result and directly take the result in formula (1).
(3) Using the weighted average method, the algorithm calculates the
final adjusted and merged position coordinates , as shown
in formulas (6)–(18). In this way, the adjustment and merging
algorithm based on the topological relationship realizes the
correction of the entity shape.
(18)
In the formula, a1 and a2 are weights, respectively, and their values are
determined according to specific data, applications, and experience
INFORMATION PROCESSING OF NATIONAL

TRADITIONAL SPORTS BASED ON SPATIAL
DIGITAL INFORMATION FUSION
The software function of the national sports digital training system is to
collect the athlete’s action video, upload it to the computer, and process the
video through the software, as shown in Figure 8.
Digital video repository
CCD camera 1
Touch the display

Digital video repository
CCD camera 2
CCD camera 1
Main processor
Touch the display

CCD camera n
CCD camera 2
Graphic image analysis software
Main processor
Figure 8: National traditional sports training system based on spatial digital information fusion.
Figure 8: CCD
National
camera n traditional sports training system based on spatial digital
information fusion. Video collection card
Graphic image analysis software

The hardware part of the national sports digital information system
Industrial
camera 1
includes video capture cards, industrial computers, industrial cameras, touch
Figure 8: National traditional sports training system based on spatial digital information fusion.
screens, and racks, as shown in Figure 9. Touchable display
Video collection card

Objective
Industrial Industrial computer
camera 1
Industrial
camera 2
Touchable display
Objective
Industrial computer
Industrial
camera 2
Figure 9: System hardware structure diagram.

The main development method of the system in this paper is shown in

Figure 10.
Computer development method
Structuring Computer information

SDLC system development
Technology environment/tools
Prototyping
method Visualization Computer-aided
technology software
engineering
Computer-aided Software
software engineering development
environment
Process-oriented
approach
(structured method) Software reuse
technology Integrated project Central
Data-oriented
/program support resource
method (information
environment database
engineering method) Other technologies
The object-oriented
method (OO method)
Figure 10: System development method.

After constructing the system of this paper, the model of this paper is
tested and verified, and the system model is verified through simulation
design. Through simulation research, the national traditional sports
information processing system based on digital information fusion proposed
in this paper is studied, and the effectiveness of the method proposed in this
paper and the traditional method is compared, and the results are shown in
Figure 11. 9
96
94
Information processing effect
92
90
88
86
84
82
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
Number
Digital effect
Figure 11: Verification of the effectiveness of the information processing of

traditional national sports based on digital information fusion.
It can be seen from the above that the effect of the traditional national
sports information processing method based on digital information fusion
proposed in this article is relatively significant. On this basis, the spatial
digital processing of this method is evaluated, and the results shown in
Table 1 and Figure 12 are obtained.
Table 1: Evaluation of the spatial digital processing effect of the national tradi-
tional sports information processing method based on digital information fusion
Number Digital effect Number Digital effect Number Digital effect
1 87.54 23 86.55 45 86.56

2 88.11 24 88.81 46 91.47
3 93.22 25 89.53 47 88.05
4 92.99 26 88.00 48 87.30
5 89.74 27 86.75 49 93.32
6 87.11 28 88.87 50 88.40
7 86.24 29 88.66 51 92.70
8 91.30 30 86.26 52 91.14
9 88.40 31 86.80 53 89.77
10 86.92 32 86.03 54 89.12
11 92.18 33 89.42 55 88.36
12 88.95 34 87.65 56 93.39
13 92.52 35 88.80 57 91.97
14 90.00 36 90.60 58 87.88
15 90.23 37 88.12 59 86.76
16 86.97 38 89.69 60 91.24
17 88.55 39 91.62 61 93.62
18 90.56 40 89.35 62 90.12
19 87.73 41 88.46 63 93.59
20 86.26 42 91.01 64 92.30
21 89.78 43 87.52 65 92.50
22 93.69 44 89.18 66 88.78
9
96
94
Information processing effect

92
90
88
86
84
82
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
Number
Digital effect
Figure 12: Statistical diagram of the spatial digital processing effect of the na-
tional traditional sports information processing method based on digital infor-
mation fusion.
From the above research, it can be seen that the national traditional
sports information processing method based on digital information fusion
proposed in this article also has a good effect in the digital processing of the
national traditional sports space.
CONCLUSION
Information technology has emerged in the field of sports, and brand-
new sports activities such as sports digitalization and sports resource
informationization have emerged. Unlike traditional online games and
e-sports, which involve finger movements and eye-moving relatively static
activities, digital sports put more emphasis on “sweating” body movements.
Moreover, it uses digital technologies such as motion capture devices and
motion sensors to transform and upgrade traditional sports to achieve
interaction and entertainment among humans, machines, and the Internet.
Digital sports will also play a particularly important role in social criticism
and cultural value orientation. This article combines digital information
fusion technology to construct the national traditional sports information
processing system and improve the development effect of national
traditional sports in the information age. The research results show that the
national traditional sports information processing method based on digital
information fusion proposed in this paper has a good effect in the digital
processing of the national traditional sports space.
REFERENCES
1. K. Aso, D. H. Hwang, and H. Koike, “Portable 3D human pose
estimation for human-human interaction using a chest-mounted
fisheye camera,” in Augmented Humans Conference 2021, pp. 116–
120, Finland, February 2021.
2. A. Bakshi, D. Sheikh, Y. Ansari, C. Sharma, and H. Naik, “Pose estimate
based yoga instructor,” International Journal of Recent Advances in
Multidisciplinary Topics, vol. 2, no. 2, pp. 70–73, 2021.
3. S. L. Colyer, M. Evans, D. P. Cosker, and A. I. Salo, “A review of
the evolution of vision-based motion analysis and the integration of
advanced computer vision methods towards developing a markerless
system,” Sports Medicine-Open, vol. 4, no. 1, pp. 1–15, 2018.
4. Q. Dang, J. Yin, B. Wang, and W. Zheng, “Deep learning based 2d
human pose estimation: a survey,” Tsinghua Science and Technology,
vol. 24, no. 6, pp. 663–676, 2019.
5. R. G. Díaz, F. Laamarti, and A. El Saddik, “DTCoach: your digital
twin coach on the edge during COVID-19 and beyond,” IEEE
Instrumentation & Measurement Magazine, vol. 24, no. 6, pp. 22–28,
2021.
6. S. Ershadi-Nasab, E. Noury, S. Kasaei, and E. Sanaei, “Multiple human
3d pose estimation from multiview images,” Multimedia Tools and
Applications, vol. 77, no. 12, pp. 15573–15601, 2018.
7. R. Gu, G. Wang, Z. Jiang, and J. N. Hwang, “Multi-person hierarchical
3d pose estimation in natural videos,” IEEE Transactions on Circuits
and Systems for Video Technology, vol. 30, no. 11, pp. 4245–4257,
2019.
8. G. Hua, L. Li, and S. Liu, “Multipath affinage stacked—hourglass
networks for human pose estimation,” Frontiers of Computer Science,
vol. 14, no. 4, pp. 1–12, 2020.
9. M. Li, Z. Zhou, and X. Liu, “Multi-person pose estimation using
bounding box constraint and LSTM,” IEEE Transactions on
Multimedia, vol. 21, no. 10, pp. 2653–2663, 2019.
10. S. Liu, Y. Li, and G. Hua, “Human pose estimation in video via
structured space learning and halfway temporal evaluation,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 29,
no. 7, pp. 2029–2038, 2019.
11. A. Martínez-González, M. Villamizar, O. Canévet, and J. M. Odobez,

“Efficient convolutional neural networks for depth-based multi-person
pose estimation,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. 30, no. 11, pp. 4207–4221, 2019.
12. W. McNally, A. Wong, and J. McPhee, “Action recognition using deep
convolutional neural networks and compressed spatio-temporal pose
encodings,” Journal of Computational Vision and Imaging Systems,
vol. 4, no. 1, pp. 3–3, 2018.
13. D. Mehta, S. Sridhar, O. Sotnychenko et al., “VNect,” ACM Transactions
on Graphics (TOG), vol. 36, no. 4, pp. 1–14, 2017.
14. M. Nasr, H. Ayman, N. Ebrahim, R. Osama, N. Mosaad, and A. Mounir,
“Realtime multi-person 2D pose estimation,” International Journal of
Advanced Networking and Applications, vol. 11, no. 6, pp. 4501–4508,
2020.
15. X. Nie, J. Feng, J. Xing, S. Xiao, and S. Yan, “Hierarchical contextual
refinement networks for human pose estimation,” IEEE Transactions
on Image Processing, vol. 28, no. 2, pp. 924–936, 2019.
16. Y. Nie, J. Lee, S. Yoon, and D. S. Park, “A multi-stage convolution
machine with scaling and dilation for human pose estimation,” KSII
Transactions on Internet and Information Systems (TIIS), vol. 13, no.
6, pp. 3182–3198, 2019.
17. I. Petrov, V. Shakhuro, and A. Konushin, “Deep probabilistic human
pose estimation,” IET Computer Vision, vol. 12, no. 5, pp. 578–585,
2018.
18. G. Szűcs and B. Tamás, “Body part extraction and pose estimation
method in rowing videos,” Journal of Computing and Information
Technology, vol. 26, no. 1, pp. 29–43, 2018.
19. N. T. Thành and P. T. Công, “An evaluation of pose estimation in
video of traditional martial arts presentation,” Journal of Research and
Development on Information and Communication Technology, vol.
2019, no. 2, pp. 114–126, 2019.
20. J. Xu, K. Tasaka, and M. Yamaguchi, “Fast and accurate whole-body
pose estimation in the wild and its applications,” ITE Transactions on
Media Technology and Applications, vol. 9, no. 1, pp. 63–70, 2021.
21. A. Zarkeshev and C. Csiszár, “Rescue method based on V2X
communication and human pose estimation,” Periodica Polytechnica
Civil Engineering, vol. 63, no. 4, pp. 1139–1146, 2015.
Chapter 16
Effects of Quality and Quantity of

Information Processing on Design
Coordination Performance
R. Zhang1, A. M. M. Liu2, I. Y. S. Chan2

Department of Quantity Survey, School of Construction Management and Real Estate,
1
Chongqing University, Chongqing, China.

2
ABSTRACT
It is acknowledged that lacking of interdisciplinary communication amongst
designers can result in poor coordination performance in building design.
Viewing communication as information processing activity, this paper aims
to explore the relationship between interdisciplinary information processing
(IP) and design coordination performance. Both amount and quality are
concerned regarding information processing. 698 project based samples are
collected by questionnaire survey from design institutes in mainland China.
Citation: Zhang, R. , M. M. Liu, A. and Y. S. Chan, I. (2018), “Effects of Quality

and Quantity of Information Processing on Design Coordination Performance”. World
Journal of Engineering and Technology, 6, 41-49. doi: 10.4236/wjet.2018.62B005.
Statistical data analysis shows that the relationship between information

processing amount and design coordination performance follows a nonlinear
exponential expression: performance = 3.691 (1-0.235IP amount) rather
than reverted U curve. It implies that design period is too short to allow
information overload. It indicates that the main problem in interdisciplinary
communication in design institute in China is insufficient information. In
additional, it is found the correlation between IP quality and coordination
process performance is much stronger than that between IP amount and
coordination process performance. For practitioners, it reminds design
mangers to pay more attention to information processing quality rather than
amount.
Keywords: Inter-Disciplinary, Information Processing, Design Coordina-

tion Performance
INTRODUCTION
Changes in construction projects are very common and could lead to project
delays and cost overruns. Lu and Issa believe that the most frequent and
most costly changes are often related to design, such as design changes
and design errors [1]. Hence, design stage is of primary importance in
construction project life-cycle [2]. Common types of design deficiencies
include design information inconsistency (e.g. location of a specific wall
differing on the architectural and structural drawings), mismatches/physical
interference between connected components (e.g. duct dimensions in
building service drawings not matching related pass/hole dimensions in
structural drawings), and component malfunctions (e.g. designing a room’s
electrical supply to suit classroom activities, while architectural drawings
designate the room as a computer lab) [3] [4]. Based on a questionnaire
survey of 12 leading Canadian design firms, Hegazy, Khalifa and Zaneldin
report eight common problems―all of which are due to insufficient and
inadequate communication and information exchange (e.g., delay in
obtaining information, not everyone on the team getting design change
information) [5]. Communication and information exchange is termed
information processing in this paper. Information processing includes the
collection, processing and distribution of information [6], and can be either
personal or impersonal (e.g. accomplished using a program) [7] [8].
Building design is a multi-disciplinary task. The process of designing a
building is the process of integrating information from multiple disciplinary
Effects of Quality and Quantity of Information Processing on Design ... 315
professionals (e.g. architects, structure engineers, building service engineers,

surveyors). Compared to intra-disciplinary coordination, inter-disciplinary
coordination is much more challenging. Information processing in the latter
situation might come across knowledge boundary, geographical remoteness,
goal heterogeneity, as well as organization boundary (in most of western
project design team). In Mainland China, most of time, all disciplinary team
are employed in the same design institute. Hence, this paper does not consider
the effect of organization boundary in the context of Mainland China.It is
acknowledged that lacking of interdisciplinary information processing
amongst designers can result in poor design coordination performance (e.g.
suboptimal solutions, design change, construction delay).
Information Processing Amount and Design Coordination

Performance
Although there is a number of studies showing that an increase in
communication or a shift in the nature of information communicated is
related to good performance in high workload situations [9], it is incorrect
to posit a linear positive relationship between information processing
amount and performance, as too much information processing leads to
information overload. Unrestricted communication can also detract from
project efficiency and effectiveness [10]. It is well-acknowledged that too
little information processing will result in poor performance (e.g. problems
in new project development, project failures), as it cannot supply necessary
information [11]. However, too much information exchange may allow good
performance, but low effectiveness and, even worse, may tax performance
due to information overload [12] [13]. Redundant information processing
overloads people’s cognitive capacity, which impedes the normal processing
of necessary information. Processing more information than necessary may
help to ensure good quality, but it does so at the cost of reduced effectiveness.
Coordination and information processing impose additional task loads on
project team actors, and should be kept to the minimum necessary to achieve
integration. In theory, the relationship between information processing
amount and design coordination performance follows a reverted U curve.
Due to tight design schedule, the situation in most of design institutes are
lacking interdisciplinary communication. The overloaded communication is
quite few.
Hence, it is hypothesized that:

The relationship between information processing amount and design
coordination performance follows a nonlinear exponential expressione of:
(1)
What is the relationship between information processing amount and
information processing quality? According to Chinese philosophy, the
accumulation of amount increase brings the improvement of quality. Under
the context of interdisciplinary information processing in Chinese design
institute, it is hypothesized that: The relationship between information
processing amount and perceived information quality follows a nonlinear
exponential expression of
Literature in the field of communication studies is reviewed here to
investigate the concept of information processing quality, for two reasons.
The first is that information process quality should be constructed as a multi-
dimensional construct to properly investigate its rich content; however,
little research within the information processing theory literature discusses
the multiple dimensions of information processing quality, perhaps
due to the short history of information processing theory. Fortunately,
in the communication study community, researchers have deeply
discussed the content of information quality in communication [14] [15].
Usually, communication refers to communication between people; here,
communication is not limited to people talking to other people directly, but
also includes people getting information from media sources, such as online
management systems on which other people have posted information.
Information processing in design coordination includes both personal
communication, and communication through programming; in this sense,
communication and information processing are the same issue, which is the
second reason why research findings from the communication studies field
can be used.
Perceived information quality (PIQ) is a concept applied, in
communication literature, to measure information processing quality, and
refers to the extent to which an individual perceives information received
from a sender as being valuable. At the cognitive level, people choose sources
that are perceived to have a greater probability of providing information that
will be relevant, reliable, and helpful to the problem at hand―attributes
that may be summarize under the label perceived source quality [16].
A substantial body of literature suggests that a receiver’s perceptions of

information quality influences the degree to which he or she is willing to
act on it. Six critical communication variables are identified by Thomas et
al. [15], four of which are highly related to information processing quality:
clarity (how clear the information received is, as indicated by the frequency
of conflicting instructions, poor communications, and lack of coordination);
understanding (shared with supervisors and other groups) of information
expectations; timeliness (of the information received, including design and
schedule changes); and, completeness (the amount of relevant information
received). Four similar variables are used by Maltz [14], in discussing
perceived information quality: credibility (the degree to which information
is perceived by the receiver to be a reliable refection of the truth);
comprehensibility (perceived clarity of the information received); relevance
(the degree to which information is deemed appropriate for the user’s task
or application); and timeliness (whether information is transmitted quickly
enough to be utilized). In a study of coordination quality in construction
management, Chang and Shen [17] used two dimensions: perceived utility
(i.e., the relevance, credibility and completeness of information) and clarity
of communication (i.e., comprehensibility, conciseness, and consistency of
representation). The two concepts seem of have some overlap.
In this study, accuracy, relevance, understanding and timeliness have
been selected to represent the multi-dimensional construct, PIQ, as shown
in Table 1. Accuracy herein refers to the degree to which information is
perceived by the receiver to be a reliable refection of the real situation.
Relevance denotes the degree to which the information is appropriate for the
user’s task. Understanding refers to the perceived clarity of the information
received. Timeliness represents whether the information is transmitted
quickly enough to allow the receiver to complete the task on time.
METHODS
Data Collection Method

Web-based questionnaire survey is applied to collect data for this
investigation.
Table 1: Measurement scale of perceived information quality
accuracy The information sent by them is accurate.

They sent me conflicting information. (R)
relevance They communicated important details of design information.
They provided information necessary in design decision making.
understanding It is easy to follow their logic.
Their terminology and concepts are easy to understand.
They presented their ideas clearly.
timeliness They provided information in a timely manner.
Their information on design change is too late.
They gave me information that are “old hat”.
The target respondents in this survey are participants in a building
design project team from a design institute in Mainland China. Respondents
are chosen based on three criteria; specifically, respondents should: 1) have
participated in a project that had been completed within the past year (as
they would be asked to recall design coordination activity); 2) have been
either a project design manager, discipline leader, or designer/engineer (top
managers is excluded); and 3) have been in one of the following disciplines
during the project ? project management, architecture, structure engineering,
mechanical engineering, electrical engineering, plumbing engineering, or
BIM engineering. 1174 questionnaire responses are received, of which
219 are completely ansared, yielding a completion rate of 18.7%. 10 of
completely questionnaires are dropped in data analysis as obvious data
outliers. Each respondent reported data on his/her dyadic interdisciplinary
design coordination with from two to seven disciplines (see Table 2). As
the level of analysis is dyadic interdisciplinary design coordination, each
questionnaire is split into two to seven samples. Data on both intra- and
inter-discipline coordination are collected, although the study’s focus is on
inter-discipline coordination. The total sample size in the inter-disciplinary
coordination data set is 698 (sum of figures in grey background).
Measurement
Interdisciplinary Communication Amount

Interdisciplinary communication frequency is applied to measure
Interdisciplinary communication amount, using a five-point scale.
Respondents are asked to indicate the frequency with which they
communicated with designers from other discipline teams (1 = zero, 2 = less

than once monthly, 3 = several times monthly, 4 = several times weekly, 5 =
several times daily). Generally, building design has three stages: conceptual
design, preliminary design, and detailed design. In each of the different
stages, information exchange frequency differs. The conceptual design stage
is dominated by architects, with the exception of limited advice-seeking
from other disciplines. The most frequent information exchange happens
in the detailed design stage, where all disciplines are heavily involved, with
each producing detailed designs to ensure the final product can function
well. Hence, frequency of interdisciplinary communication in the detailed
design stage is used to test hypotheses in this study.
Table 2: Matrix of dyadic coordination samples
GD Archi. SE ME EE PE BIM
GD 2 5 5 2 2 2 2
Archi. 15 65 65 44 44 44 44
SE 12 66 66 38 38 38 38
ME 2 15 15 6 6 6 6
EE 1 12 12 2 2 2 2
PE 2 11 11 7 7 7 7
BIM 3 19 19 9 9 9 9
Notes: GD: General Drawing; Archi.: Architecture; SE: Structure

Engineering; ME: Mechanical Engineering; EE: Electrical Engineering; PE:
Plumping Engineering; BIM: Building Information Modelling.
Coordination Performance
Coordination process performance refers to the extent to which the
respondent (focal unit a) has effective information processing with another
person in the design team (unit j). It is a dyadic concept, and the five-item
dyadic coordination performance scale used by Sherman and Keller [18]
is applied. The scale includes items examining: 1) the extent to which the
focal unit a had an effective working relationship with unit j; 2) the extent to
which unit j fulfilled its responsibilities to unit a; 3) the extent to which unit
a fulfilled its responsibilities to unit j; 4) the extent to which the coordination
is satisfactory; and, 5) the positive or negative effect on productivity, as a

result of the coordination.
DATA ANALYSIS
Information Processing Amount and Design Coordination Per-

formance
For coordination process performance (Table 3), b1 and b2 are quite
significant in Model 1. This suggests that the relationship between
frequency of interdisciplinary communication in the detailed design stage
and coordination process performance can be expressed as:
Performance=3.691(1−0.235x)Performance=3.691(1−0.235x)
H1 is thus strongly supported. In Models 2, 3 and 4, b2 is not significant.
One possible reason is that, many other factors influence design project
performance, besides coordination process performance.
Table 3: Information processing amount and design coordination performance
Model 1 Model 2 Model 3 Model 4

Dependent Coordination Design quality Design schedule Design cost
variable process perfor- control
mance
Independent Frequency of Frequency of Frequency of Frequency of
variable interdisciplin- interdisciplin- interdisciplinary interdisciplinary
ary communi- ary communi- communication communication
cation cation
b1 3.691*** 3.745*** 3.626*** 3.539***
(0.0571) (0.0472) (0.0468) (0.0472)
b2 0.235*** 0.0133 −0.00138 −0.0509
(0.0448) (0.0603) (0.0618) (0.0630)
N 642 445 445 444
Adjusted 0.904 0.937 0.934 0.928
R-squared
Standard errors in parentheses *p < 0.05, **p < 0.01, ***p < 0.001;
Difference on sample size due to missing data.
Information Processing Amount and Information Processing

Quality
As for the relationship between information processing (IP) amount and
information processing quality, the nonlinear exponential expression
of Performance=b1(1−b2x)Performance=b1(1−b2x) is computed, as shown
in Table 4. Both b1 and b2 are significant. IP quality = 3.353 (1−0.311IP
amount
). H2 is strongly supported.
With the assumption that both information processing amount and
information processing quality is positively related with design coordination
process performance, an explorative study is conducted to compare the
correlation using regression function. The results are showed in Table 5.
As it is an explorative study, P value less than 0.1 is accepted as
significant.
The results show that 1) Both IP amount and IP quality are positively
related with coordination process performance; 2) In addition, it is found
the correlation between IP quality and coordination process performance
is much stronger than that between IP amount and coordination process
performance.
DISCUSSION
On one hand, insufficient interdisciplinary communication will lead to
coordination failure. On the other hand, too much information processing
will lead to information overload as well as coordination cost overrun. The
challenge for cross-functional teams is to ensure the level of information
exchange amongst team members allows them to optimize their performance
[11]. In this study, it is found that information processing amount is positively
related to coordination process performance; specifically, it is found that the
relationship between the frequency of interdisciplinary communication in
the detailed design stage and coordination process performance followed
the nonlinear exponential expression of performance = 3.691 (1−0.235IP
amount
). Whether the finding can be used in other areas rather than Mainland
China need further study.
Table 4: Information processing amount and information processing quality
Dependent variable Perceived information quality

Independent variable Frequency of interdisciplinary communication
b1 3.353***
(0.0530)
b2 0.311***
(0.0371)
N 852
Adjusted R-squared 0.892
Standard errors in parentheses *p < 0.05, **p < 0.01, ***p < 0.001;
Difference on sample size due to missing data.
Table 5: Information processing amount, information processing quality and

coordination process performance
Path Beta Std.Err. z P>z 90% Conf. Interval

designer cp < −iefd 0.136 0.071 1.930 0.054 −0.002 0.275
disciplinary leader cp < −iefd 0.128 0.042 3.070 0.002 0.046 0.210
designer cp < −PIQ 0.614 0.046 13.23 0.000 0.523 0.705
disciplinary leader cp < −PIQ 0.748 0.027 27.89 0.000 0.695 0.800
Although both IP amount and IP quality are positively related with

coordination process performance, the correlation between IP quality and
coordination process performance is much stronger than that between IP
amount and coordination process performance. The result is consistent
with former research on decision effectiveness, in which the impact of
information quality is stronger [19]. It suggests that more attention should be
paid on improving information processing quality. To improve information
processing quality, effort can be made on improving information accuracy,
relevance, understanding and timeliness. The role of building information
modelling in improving interdisciplinary communication could be
investigated in the future.
CONCLUSION
This paper explores the relationship between interdisciplinary communication
and design coordination performance in design institutes in Mainland China.
From information processing perspective, interdisciplinary communication

is viewed as information processing activity. Both information processing
amount and quality are concerned. Information processing quality is
measured by four dimensions: perceived information accuracy, relevance,
understanding and timeliness. Based on 698 samples of quantitative survey
data in project level, it is found that the relationship between information
processing amount and design coordination process performance follows
a nonlinear exponential expression: performance = 3.691 (1−0.235IP amount)
rather than reverted U curve. It implies that design period is too short to allow
information overload. It indicates that the main problem in interdisciplinary
communication in design institute in China is insufficient information. In
additional, it is found the correlation between IP quality and coordination
process performance is much stronger than that between IP amount and
coordination process performance. For practitioners, it reminds design
mangers to pay more attention to information processing quality rather than
amount.
REFERENCES
1. Lu, H. and Issa, R.R. (2005) Extended Production Integration
for Construction: A Loosely Coupled Project Model for Building
Construction. Journal of Computing in Civil Engineering, 19, 58-68.
https://doi.org/10.1061/(ASCE)0887-3801(2005)19:1(58)
2. Harpum, P. (Ed.) (2004) Design Management. John Wiley and Sons
Ltd., USA. https://doi.org/10.1002/9780470172391.ch18
3. Korman, T., Fischer, M. and Tatum, C. (2003) Knowledge and Reasoning
for MEP Coordination. Journal of Construction Engineering and
Management, 129, 627-634. https://doi.org/doi:10.1061/(ASCE)0733-
9364(2003)129:6(627)
4. Mokhtar, A.H. (2002) Coordinating and Customizing Design
Information through the Internet. Engineering Construction and
Architectural Management, 9, 222-231. https://doi.org/10.1108/
eb021217
5. Hegazy, T., Khalifa, J. and Zaneldin, E. (1998) Towards Effective
Design Coordination: A Questionnaire Survey 1. Canadian Journal of
Civil Engineering, 25, 595-603. https://doi.org/10.1139/l97-115
6. Tushman, M.L. and Nadler, D.A. (1978) Information Processing
as an Integrating Concept in Organizational Design. Academy of
Management Review, 613-624.
7. Dietrich, P., Kujala, J. and Artto, K. (2013) Inter-Team Coordination
Patterns and Outcomes in Multi-Team Projects. Project Management
Journal, 44, 6-19. https://doi.org/10.1002/pmj.21377
8. Van de Ven, A.H., Delbecq, A.L. and Koenig Jr., R. (1976) Determinants
of Coordination Modes within Organizations. American Sociological
Review, 322-338. https://doi.org/10.2307/2094477
9. Mathieu, J.E., Heffner, T.S., Goodwin, G.F., Salas, E. and Cannon-
Bowers, J.A. (2000) The Influence of Shared Mental Models on Team
Process and Performance. Journal of Applied Psychology, 85, 273.
https://doi.org/10.1037/0021-9010.85.2.273
10. Katz, D. and Kahn, R.L. (1978) The Social Psychology of Organizations.
11. Patrashkova, R.R. and McComb, S.A. (2004) Exploring Why More
Communication Is Not Better: Insights from a Computational
Model of Cross-Functional Teams. Journal of Engineering and
Technology Management, 21, 83-114. https://doi.org/10.1016/j.
jengtecman.2003.12.005
12. Goodman, P.S. and Leyden, D.P. (1991) Familiarity and Group
Productivity. Journal of Applied Psychology, 76, 578. https://doi.
org/10.1037/0021-9010.76.4.578
13. Boisot, M.H. (1995) Information Space. Int. Thomson Business Press.
14. Maltz, E. (2000) Is All Communication Created Equal? An Investigation
into the Effects of Communication Mode on Perceived Information
Quality. Journal of Product Innovation Management, 17, 110-127.
https://doi.org/10.1016/S0737-6782(99)00030-2
15. Thomas, S.R., Tucker, R.L. and Kelly, W.R. (1998) Critical
Communications Variables. Journal of Construction Engineering
and Management, 124, 58-66. https://doi.org/10.1061/(ASCE)0733-
9364(1998)124:1(58)
16. Choo, C.W. (2005) The Knowing Organization. Oxford University
Press. https://doi.org/10.1093/acprof:oso/9780195176780.001.0001
17. Chang, A.S. and Shen, F.-Y. (2014) Effectiveness of Coordination
Methods in Construction Projects. Journal of Management in
Engineering. https://doi.org/10.1061/(ASCE)ME.1943-5479.0000222
18. Sherman, J.D. and Keller, R.T. (2011) Suboptimal Assessment of
Interunit Task Interdependence: Modes of Integration and Information
Processing for Coordination Performance. Organization Science, 22,
245-261. https://doi.org/10.1287/orsc.1090.0506
19. Keller, K.L. and Staelin, R. (1987) Effects of Quality and Quantity of
Information on Decision Effectiveness. Journal of Consumer Research,
14, 200-213. https://doi.org/10.1086/209106
Chapter 17
Neural Network Optimization Method

and Its Application in Information
Processing
Pin Wang1, Peng Wang2, and En Fan3

1
School of Mechanical and Electrical Engineering, Shenzhen Polytechnic, Shenzhen
2
Garden Center, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou
Department of Computer Science and Engineering, Shaoxing University, Shaoxing 312000,
3
Zhejiang, China
ABSTRACT
Neural network theory is the basis of massive information parallel processing
and large-scale parallel computing. Neural network is not only a highly
nonlinear dynamic system but also an adaptive organization system, which
can be used to describe the intelligent behavior of cognition, decision-
making, and control. The purpose of this paper is to explore the optimization
Citation: Pin Wang, Peng Wang, En Fan, “Neural Network Optimization Method and
Its Application in Information Processing”, Mathematical Problems in Engineering,
vol. 2021, Article ID 6665703, 10 pages, 2021. https://doi.org/10.1155/2021/6665703.
method of neural network and its application in information processing. This

paper uses the characteristic of SOM feature map neural network to preserve
the topological order to estimate the direction of arrival of the array signal.
For the estimation of the direction of arrival of single-source signals in array
signal processing, this paper establishes a uniform linear array and arbitrary
array models based on the distance difference vector to detect DOA. The
relationship between the DDOA vector and the direction of arrival angle is
regarded as a mapping from the DDOA space to the AOA space. For this
mapping, through derivation and analysis, it is found that there is a similar
topological distribution between the two variables of the sampled signal. In
this paper, the network is trained by uniformly distributed simulated source
signals, and then the trained network is used to perform AOA estimation
effect tests on simulated noiseless signals, simulated Gaussian noise signals,
and measured signals of sound sources in the lake. Neural network and
multisignal classification algorithms are compared. This paper proposes
a DOA estimation method using two-layer SOM neural network and
theoretically verifies the reliability of the method. Experimental research
shows that when the signal-to-noise ratio drops from 20 dB to 1 dB in the
experiment with Gaussian noise, the absolute error of the AOA prediction
is small and the fluctuation is not large, indicating that the prediction effect
of the SOM network optimization method established in this paper does
not vary. The signal-to-noise ratio drops and decreases, and it has a strong
ability to adapt to noise.
INTRODUCTION
In the information society, the increase in information generation is getting
bigger [1]. To make information available in a timely manner to serve the
development of the national economy, science and technology, and defense
industry, it is necessary to collect, process, transmit, store, and make decisions
on information data. Theoretical innovation and implementation are carried
out to meet the needs of the social development situation. Therefore, neural
networks have extremely extensive research significance and application
value in information science fields such as communications, radar, sonar,
electronic measuring instruments, biomedical engineering, vibration
engineering, seismic prospecting, and image processing. This article focuses
on the study of neural network optimization methods and their applications
in intelligent information processing.
Neural Network Optimization Method and Its Application in Information ... 329
Based on the research of neural network optimization method and its

information processing, many foreign scholars have studied it and achieved
good results. For example, Al Mamun MA has developed a new method of
image restoration using neural network technology, which overcomes to a
certain extent the above shortcomings of traditional methods. In addition,
neural networks have also been widely used in image edge detection,
image segmentation, and image compression [2]. Hamza MF proposed
a BP algorithm to train RBF weights. The BP algorithm with additional
momentum factor can improve the training coefficient of the network and
avoid the occurrence of oscillations, which improves the training rate of
the network [3]. Tom B proposed an RBF-PLSR model based on genetic
clustering. This model uses the clustering analysis of genetic algorithm
to determine the number of hidden layer nodes and the center of hidden
nodes in the RBF network, and the PLSR method is used to determine the
network’s right to connect [4, 5].
In my country, an adaptive linear component model is proposed. They
also made Adaline into hardware and successfully applied it to offset
the echo and noise in communications. Quan proposed the error back-
propagation algorithm, the BP algorithm, which in principle solves the
problem of the multilayer neural network training method, which makes
the neural network have strong computing power and greatly increases the
vitality of the artificial neural network [6]. Cheng uses mathematical theory
to prove the fundamental limitation of single-layer perceptron in computing.
However, for multilayer neural networks with hidden layers, an effective
learning algorithm has not yet been found [7].
In this paper, the problem of single-signal source azimuth detection
under uniform linear sensor array and arbitrary array is studied, and the
direction-of-arrival detection array model is established, respectively. In the
case of a uniform linear array, this paper establishes a two-layer SOM neural
network. First, explain the theoretical basis of this neural network, that is,
the homotopological structure between the input vector and the output result.
For this reason, we separately analyzed the topological structure of the
DDOA vector and the predicted value of the AOA in the case of a uniform
linear array. Through derivation and simulation data, we can see that the two
do have similar topological structures, which led us to establish the SOM
neural network system. It can be applied to AOA prediction problems based
on DDOA. Finally, simulation experiments and lake water experiments
verify the practical feasibility of this method.
NEURAL NETWORK OPTIMIZATION METHOD AND

ITS RESEARCH IN INFORMATION PROCESSING
Array Optimization and Orientation Based on DDOA and

SOM Neural Network
Signal and information processing mainly includes three main processes:
information acquisition, information processing, and information
transmission [8, 9]. The array signal processing can be regarded as an
important branch of modern data signal processing. Its main research object
is the signal transmitted in the form of spatial transmission wave. It receives
the wave signal through a sensor array with a certain spatial distribution
and performs information on the received signal extract. This paper mainly
studies the algorithm of the sensor array to detect the sound wave’s azimuth,
namely, the direction of arrival (DOA).
Array Signal Model

Array signal processing is often based on a strict mathematical theoretical
model based on a series of assumptions about the observed signal. The
objects explored in this article are all two-dimensional spatial signal
problems. These assumptions stem from the abstraction and generalization
of the observed signal and noise.
(1) Narrowband signal: when the bandwidth of the spatial source
signal is much smaller than its center frequency, we call this
spatial source signal a narrowband signal; that is, the general
requirement is met.
(1)
where WB is the signal bandwidth and fo is the signal center frequency. A
single-frequency signal with a center frequency of fo can be used to simulate
a narrowband signal. The sine signal as we know it is a typical narrowband
signal. The analog signals used in this article are all single-frequency sine
signals.
(2) Array signal processing model: suppose that there is a sensor
array in the plane, in which M sensor array elements with
arbitrary directivity are arranged, and K narrowband plane waves
are distributed in this plane. The center frequencies of these plane
waves are all w0 and the wavelength is λ, and suppose that M > K
(that is, the number of array elements is greater than the number of
incident signals). The signal output received by the k-th element
at time t is the sum of K plane waves; namely,
(2)
where ak(θi) is the sound pressure response coefficient of element k to source
i, si (t − τk(θi )) is the signal wavefront of source i, and τk(θi) is the relative
value of element k to the reference element time delay. According to the
assumption of narrowband waves, the time delay only affects the wavefront
by the phase change,
(3)
Therefore, formula (2) can be rewritten as
(4)
Write the output of M sensors in vector form; the model becomes
(5)
Among them,
(6)
It is called the direction vector of the incoming wave direction 0.
Let
. The other measurement noise is n (t); then the above array model can be
expressed as
(7)
Among them, is the direction
matrix of the array model.
Subspace Decomposition Based on Eigendecomposition of Array

Covariance Matrix
The DOA estimation problem is the estimation of the direction of arrival
angle and the parameter θi (i = 1, 2, . . . , K) in natural space, which requires
the covariance information between the different elements of the array for
analysis. For this, first calculate the spatial covariance matrix output by the
array:
(8)
where E{.} represents statistical expectation; let
(9)
(10)
It is the covariance matrix of noise. It is assumed that the noise received
by all elements has a common variance, and is also the noise power.
From equations (9) and (10), we can get
(11)
It can be proved that R is a nonsingular matrix and a positive definite
Hermitian square matrix; that is, RH = R. ,erefore, the singular value
decomposition of R can be performed to achieve diagonalization, and the
eigendecomposition can be written as follows:
(12)
where U is the transformation unitary matrix, so that matrix R is diagonalized
into a real-valued matrix Λ = diag(λ1, λ2, . . . , λM), and the eigenvalues are
ordered as follows:
(13)
From equation (13), it can be seen that any vector orthogonal to A is an
eigenvector of matrix R belonging to the eigenvalue .
RBF Neural Network Estimates the Direction of Arrival

RBF neural network is a method that can perform curve fitting or interpolation
in high-dimensional space. If the relationship between the input space and
the output space is regarded as a mapping, this mapping can be regarded as
defined in the high-dimensional space. A hypersurface of the input data and
a designed RBF neural network are equivalent to the height fitting of this
hypersurface. It establishes an approximate hypersurface by interpolating
the input data points [10, 11].
The sensor array is equivalent to a mapping from the DOA space
to the sensor array output space ,
a mapping :
(14)
where K is the number of source signals, M is the number of elements of the
uniform linear array, ak is the complex amplitude of the k-th signal, α is the
initial phase, ω0 is the signal center frequency, d is the element spacing, and
c is the propagation speed of the source signal [12, 13].
When the number of information sources has been estimated as K, the
function of the neural network on this problem is equivalent to the inverse
problem of the above mapping, that is, the inverse mapping .
To obtain this mapping, it is necessary to establish a neural network structure
in which the preprocessed data based on the incident signal is used as the
network input, and the corresponding DOA is used as the network output
after the hidden layer activation function is applied. The whole process is
a targeted training process, and the process of fitting the mapping with the
RBF neural network is equivalent to an interpolation process.
Estimation of Direction of Arrival of Uniform Linear

Array SOM Neural Network
Kohonen Self-Organizing Neural Network

A SOM neural network consists of two layers: the input layer and the
competition layer (also called the output layer). The number of nodes in the
input layer is equal to the dimension of the input vector, and the neurons
in the competing layer are usually arranged in a rectangle or hexagon on a
two-dimensional plane. The output node j and the input node are connected
by weights:
(15)
The training steps of the Kohonen SOM neural network used in this
article are as follows: the first step is network initialization [14, 15].
Normalize the input vector x to such that :
(16)
where x = [x1, x2, . . . , xm] T is the training sample vector of the network.
Initialize the network weight wj (j = 1, 2, . . . , K) to be the same as the
partially normalized input vector e’.
The second step is to calculate the Euclidean distance between the input
vector and the corresponding weight vector ωj of each competing layer
neuron to obtain the winning neuron ωc [16, 17]. The selection principle of
the winning neuron is as follows:
(17)
,e third step is to adjust the weight of the winning neuron ωc and its
neighborhood ωj. ,e adjustment method is as follows:
(18)
Among them, η(t) is the learning rate function, which decreases with
the number of iteration steps t [18, 19]. ,e function Uc(t) is the neighborhood
function; here is the Gaussian function:
(19)
where r is the position of the neurons in the competition layer on a two-
dimensional plane and σ is the smoothing factor, which is a normal number.
DOA Estimation Model Based on SOM Neural Network

Build a two-layer SOM neural network. The first layer of SOM neural
network is the sorting layer, which maps the input training data into a two-
dimensional space. According to the activation of neuron nodes on the first
two-dimensional grid, the output of the corresponding neuron node in the
second grid is defined by the following rules:
(1) If the neuron node j is activated by only one training sample

vector and the signal position corresponding to this sample is
, then the output of the corresponding
node of the second layer of grid is the direction angle of this
signal [20, 21], namely,
(20)
(2) If the neuron node j is activated by more than one training sample
vector, that is, nj > 1, and the signal positions corresponding to
these samples are , then the output of
the corresponding node of the second layer of grid is the average
value of the direction angle of these signals [22, 23], namely,
(21)
(3) If the neuron node j has never been activated by any training
sample vector, the corresponding output neuron node is regarded
as an invalid node. When this node is activated by a new input
vector, the output value is defined as the output direction angle of
the valid node closest to this node.
Method Reliability Analysis

The establishment process of the two-layer SOM neural network we
proposed above shows that the topological order of AOA is similar to
the topological distribution of DDOA vectors. In other words, when the
Euclidean distance between two DDOA vectors is small, the Euclidean
distance of the corresponding AOA value must also be small. This is the
theoretical basis for our proposed method, and we will conduct a detailed
analysis on this nature.
Suppose that the DDOA vectors of two adjacent source signals are d and
d1 = d + Δ d, and the corresponding AOAs are θ and θ1 = θ + Δθ, respectively.
The DDOA increment and AOA increment are
(22)
where ; obviously the function di,j+1 at the

point (x, y) ∈ R is differentiable, which shows that the DDOA vector d
2
and the AOA value θ have a consistent trend [24, 25]. In other words, when
the DDOA vectors of two source signals are similar, their arrival direction
angles AOA must also be similar. Therefore, the topological orders and
distributions of DDOA vector and AOA are basically the same.
Genetic Clustering Method

In cluster analysis, the K-means clustering method is a clustering method
that is often used. Generally, when determining the structure of the RBF
network, this method is used to determine the number of hidden layer nodes
of the network and the center of the node.
Chromosome Coding and Population Initialization

In order to accelerate the speed of convergence, we use real number coding
[26]. For samples with m-dimensional dimensions, if the number of classes
to be classified is n, the centers of n classes are encoded, and the dimensions
of each center are m-dimensional; then the length of the chromosome is n × m.
In this way, a chromosome represents a complete classification strategy.
Initialize the preset number of chromosomes to get the initial population.
Determination of Fitness Function and Selection of Fitness

For each chromosome, according to the classification information carried on
it, according to the idea of distance classification, the classification of each
sample in the original data can be determined, and the distance between
the sample and its category center (here is Euclidean distance) can also be
determined [27, 28]. After determining the classification of the sample, the
sum of the distances within the class can be calculated:
(23)
At the same time, the sum of the distances between classes can also be
found:
(24)
where F is the sum of distances within classes, Q is the sum of distances

between classes, k is the number of classes in the classification, ni is the
number of samples belonging to the i-th class, xj is the j-th sample of the i-th
class, and Ci is the center of the i-th class.
NEURAL NETWORK OPTIMIZATION METHOD

AND ITS EXPERIMENTAL RESEARCH IN
Underwater Experimental Research in the Lake

The underwater experiment is carried out in the lake. The average depth of
the lake water is 50 meters to 60 meters. The area of open water is more than
300 meters *1200 meters, and the water body is relatively stable and suitable
for DOA estimation experiments. The experimental equipment used this
time is a uniform linear array composed of 4 acoustic pressure hydrophones
with an array spacing of 0.472 meters.
Experimental Methods and Data Collection
No Noise
In order to verify the effectiveness of the two-layer SOM neural network
established in this paper for arbitrary array conditions, we conducted a
simulation experiment of detecting the direction of acoustic signals with
arbitrary sensor arrays underwater. Assuming that the sensor array contains
4 sensors, the frequency of a single sound source signal is f = 2 kHz, the
propagation speed of the sound signal in water is c = 1500 m/s, and the
distance between two adjacent sensors is Δi = 0.375, which is the wavelength
half. The positions of the four sensor array elements are (x1 = 0.y1 = 0), (x2 =
0.3, y2 = 0.225), (x3 = 0.5, y3 = −0.0922), and (x4 = 0.6, y4 = 0.2692). In order
to obtain the training vector, we uniformly collect 60 × 30 points from the
rectangular area [−20, 20] × [0, 20] ∈ R2 as the emission positions of 1800
simulated sound source signals, which can calculate 1800 DDOA vectors r,
and input them into the network as training vectors of the network.
Calculate the value of Rmax(x, y):
(25)
Except for the few points near the origin (0, 0), the function Rmax(x, y)
at most of the remaining points has a common upper bound, which belongs
to the second case.
Noise
In practice, the signal data collected by the sensor array is often noisy, and
the energy of noise is generally large. The signal-to-noise ratio between
signal and noise often reaches very low values, even below 0 dB; that is, the
signal is overwhelmed by environmental noise that is much stronger than its
strength. When the signal-to-noise ratio is particularly small, people usually
perform a denoising filtering process artificially in advance to make the
filtered signal-to-noise ratio at least above 0 dB. Therefore, a good model
that can be applied to practice must be applicable to noisy environments.
Performing Genetic Clustering on Standardized Training Sam-

ple Data
The number of preselected clustering categories is in the interval between
1/7 and 1/4 of the total number of samples (in order to facilitate the training
of the network, too few or too many categories will result in poor training
effects), take the population number as 30, the crossover rate is 75%, and
the mutation rate is 5%. The fitness function is selected so that the ratio of
the interclass distance to the intraclass distance increases with the increase
of the fitness function, and a convergent solution can be obtained in about
50 generations. In the interval class, the number of classes is changed one
by one until the fitness function is minimized. The number of categories at
this time is the number of hidden nodes in the RBF network, and the center
of each category is the center of the node.
NEURAL NETWORK OPTIMIZATION METHOD

AND ITS EXPERIMENTAL RESEARCH ANALYSIS IN
Noise-Free Simulation Experiment

To test the performance of the network, we select six sets of source signals
with different distances from the origin, that is, six sets of points as the
test. The distances to the origin of the coordinates are 8 meters, 16 meters,
20 meters, 30 meters, 50 meters, and 100 meters. Each group contains 21
simulated signals with different AOA values. Calculate the DDOA vectors
corresponding to these simulated signal emission points, and then input
these vectors as test vectors into the trained two-layer SOM neural network.
The output of the network is the corresponding AOA predicted value. The
experimental results are shown in Figure 1.
Figure 1: Absolute error of AOA predicted value of source signal at different

distances using SOM neural network.
The absolute error value of the AOA prediction result is shown in Figure
1. It can be seen that not only can the SOM network trained with the near-
field simulation signal (the signal position is within the area [0, 21] × [0, 21])
perform the training in the near-field (4 m, 8 m, and 12 m) but also the AOA
prediction effect of the test signal is good. Except for individual points, the
AOA prediction accuracy of the test signal (16 m, 32 m, and 64 m) in the far
field is also very high, and the error is basically controlled in the interval
, the error is smaller than the near field, the effect is better,
and the error fluctuation is smaller.
To illustrate the effectiveness and scalability of this method in predicting
AOA, we set up an RBF neural network for comparison. The RBF neural
network established here uses the DDOA vector of the same simulation
signal (within the area [0, 20] [0, 20]) as the input vector of the network
training and the corresponding AOA value as the target output of the network
training.
As shown in Table 1 and Figure 2, the average of the absolute value of the
AOA prediction error of the noise-free signal in the simulation experiment
is approximately 0.1° to 0.4°, the minimum is 0.122°, and the maximum is
only 0.242°, and most of the test signals (accounting for the absolute value
of the prediction error of the number of test signals (70% ∼ 80%)) are less
than 0.1°.
Table 1: SOM neural network prediction results of noise-free signal AOA
Distance 4 8 12 16 32 64
Average error 0.215 0.124 0.147 0.105 0.109 0.152

Pr (err <0.3°) 0.763 0.862 0.901 0.986 0.853 0.901
Pr (err <0.2°) 0.782 0.816 0.792 0.827 0.879 0.815
1.2
1
Experimental parameters
0.8
0.6
0.4
0.2
0
4 8 1 16 2 4
Distance
Average error
Pr (err < 0.3°)
Figure 2: SOM neural network prediction results of noise-free signal AOA.

To illustrate the effectiveness and scalability of this method in predicting
AOA, we set up an RBF neural network for comparison. The RBF neural
network established here uses the DDOA vector of the same simulation
signal (within the area [0, 20] [0, 20]) as the input vector for network
training, and the corresponding AOA value is used as the target output of the
network training. Experimental results are shown in Table 2.
Table 2: Comparison of AOA errors predicted by SOM neural network and

RBF neural network
x 0 5 10 15 20 25 30 35 40
BRF 0.43 −0.05 0.08 −0.16 0.09 6.25 15.64 16.28 17.89
SOM −0.08 0.04 −0.11 0.03 0.08 0.13 0.01 0.04 0.02
As shown in Figure 3, the two networks both use the same 20 simulated
signals as test signals. The transmission positions of the tested signals are
evenly distributed between 2 meters and 40 meters from the origin, including
the training area, that is, within 20 meters. It can be seen from Figure 4
that the prediction effect of the RBF neural network in the training area is
similar to that of the SOM neural network, but the prediction effect outside
the training area is poor, while the SOM neural network shows strong
adaptability to distance changes. This shows that the RBF neural network
will be affected by the distance factor, because its training principle is to use
the idea of interpolation to fit the mapping relationship, which makes the
error larger when the test data exceeds the training range.
SOM
BRF
– 0 5 10 15 20
Parameter value
40 25 10
35 20 5
30 15 0
Figure 3: Comparison of AOA errors predicted by SOM neural network and

RBF neural network.
Experimental parameter value 2.48 1.23

1.98
1.28
1.36
0.08 2.62
3.89 2.18
0.68
– – 0.03 0.11 –
0 5 10 15 20
–
– –
–
–
OAO
10 degrees 30 degrees
20 degrees 40 degrees
Figure 4: SOM neural network predicts AOA resulting in the case of additional
Gaussian noise.
Simulation Experiment with Gaussian Noise

In practical applications, the signal received by the sensor array is often
noisy. So here we use the signal containing Gaussian white noise to test
the abovementioned SOM neural network, and the test results are shown in
Table 3.
Table 3: SOM neural network predicts AOA resulting in the case of additional
Gaussian noise
AOA (degrees) 0 5 10 15 20
10 −0.42 0.68 −0.83 0.11 −0.29

20 −6.24 −0.16 0.03 −2.57 2.62
30 3.89 0.08 −0.18 2.18 −1.38
40 1.28 2.48 1.36 1.98 1.23
It can be seen from Figure 4 that when the signal-to-noise ratio drops
from 20 dB to 1 dB, the absolute error of the AOA prediction is small and
the fluctuations are not large; that is, when the signal-to-noise ratio is greater
than 1 dB, we establish that the prediction effect of the SOM network does
not decrease with the decrease of the signal-to-noise ratio, and it has strong
adaptability to noise.
Experimental Analysis of Underwater Experiment Results in

the Lake
During the experiment, the hydrophone array was placed at a depth of 3.7
meters below the surface of the water to be fixed. The sound source to be
measured is produced by a transducer. Here, the transducer is placed at a
depth consistent with the depth of the hydrophone, and the sound wave
frequency is 2 KHz. The position and direction of the sound source are
changed by changing the position of the transducer. When the transducer is
at a certain position, the hydrophone array receives the signal, considering
the speed of the sound signal in the water and the time difference of the
signal after noise reduction, and we can get the DDOA vector. Input the
DDOA vector into the pretrained SOM neural network to get the estimated
value of AOA. The estimated results are shown in Table 4.
Table 4: AOA prediction results of the experiment in the lake
Actual AOA Sound distance (m) Forecast AOA Absolute error
85 900 78.5 7.4

55 500 47.3 3.8
25 200 26.2 5.2
As shown in Figure 5, the results of the noise-free signal test show
that this method is ideal for both near-field signals and far-field signals;
the test results for signals containing Gaussian white noise reflect the high
prediction accuracy of this method under low signal-to-noise ratios. Further,
the applicability of this method in actual experiments can be seen from the
results of lake water field experiments.
1400
1200
1000 900
800
Parameter value
600 500
400
200 200 78.5 47.3
7.4 3.8
0 26.2 5.2
Sound distance(m) Forecast AOA Absolute error
–
–
85
55
Figure 5: AOA prediction results of the experiment in the lake.

This paper uses the two-layer SOM neural network proposed earlier and
inputs the DDOA vector consistent with the previous simulation experiment
as training data into the network to train the network. Assuming that similar
DDOA vectors also correspond to similar signal emission positions, the
signal emission positions are also estimated according to the aforementioned
three neuron activation situations and principles. The test signal emission
points are selected as 20 points in the area where the simulation signal used
for training the network is located, and they are evenly distributed along the
curve. The test results are shown in Table 5.
Table 5: Using SOM neural network to predict signal position results
X 0 5 10 15 20
Real 3.68 5.18 6.63 8.46 17.36

Estimation 4.95 6.32 7.13 10.47 15.36
The test result is shown in Figure 6, the actual position point number
is consistent with the abscissa, and the corresponding network estimated
position is marked with a number. It can be seen that, except for a few
points, the estimated deviation is small, and the other deviations are still
large. Therefore, the hypothesis is not true, and this two-layer SOM neural
network is not suitable for the estimation of signal transmission position.
25
20
17.36 16.65
15.36
Paremeter value
15
10.47
10 9.24
8.46
7.13
6.32 6.63 6.93
4.95 5.18 5.73
5 3.68 4.53
0
0 5 0 15 20
X
Real
Estimation
Expected
Figure 6: Using SOM neural network to predict signal position results.
Comparison and Analysis of Results

The category center of K-means clustering method is generated based on a
limited number of iterations, while the category center of genetic clustering
is generated by global search. Therefore, in terms of the clustering effect and
the robustness of the category center, genetic clustering algorithm is better
than k-means clustering method in these two points. Table 6 lists the ratio of
the sum of the interclass distances to the sum of intraclass distances obtained
by using these two methods, respectively, that is, the fitness function.
Table 6: Comparison of genetic clustering and K-means clustering methods
Number of categories Fitness

K-means clustering method Genetic clustering method
4 3.26 4.61
5 3.52 4.78
6 3.78 4.98
7 3.62 5.18
8 3.29 4.72
It can be seen from Figure 7 that, compared to the K-means clustering

method, the genetic clustering method has obvious advantages in the
“cohesion effect” of the cluster center. Especially for RBF networks, the
“cohesion effect” of the center often has a greater impact on the performance
of the network, so although from the perspective of purely physical or
chemical clustering analysis, genetic clustering is not necessarily superior
to K-means method, in terms of the effect of RBF network learning, genetic
clustering has greater advantages compared to K-means clustering.
8 4.72
3.29
7 5.18
Number of categories
3.62
6 4.98
3.78
5 4.78
3.52
4 4.61
3.26
Parameter value
Genetic clustering method
K-means clustering method
Figure 7: Comparison of genetic clustering and K-means clustering methods.
Influence of the Number of Neuron Nodes on the Prediction

Effect
In order to study the influence of the number of neuron nodes in the network
on the prediction effect, simulation experiments were carried out on 6
different neuron node distribution modes. The absolute value of the absolute
error of the experimental results is then averaged, as shown in Table 7.
Table 7: Average absolute value of AOA prediction error under different neuron
node arrangements
Node arrangement 20 × 20 25 × 25 30 × 30 35 × 35 40 × 40 45 × 45
Rectangular domain 0.72 0.61 0.57 0.48 0.55 0.62
Circle 0.51 0.56 0.42 0.37 0.33 0.41
As shown in Figure 8, it can be seen that the network prediction effect

using the signal on the circle as the training data is better than that of the
signal training network in the rectangular area. The prediction accuracy of
the network with a neuron node distribution of 40 × 40 is relatively better
than that of other node distributions. This distribution is an n × n square
matrix arrangement, which is slightly smaller in number than the number of
training samples. It can also be seen that the prediction effect of the network
is not better as the number of neuron nodes increases.
0.8
0.7
0.6
Parameter value
0.5
0.4
0.3
0.2
0.1
0
20 × 20 25 × 25 30 × 30 35 × 35 40 × 40 45 × 45
Node arrangement
Rectangular domain
Circle
Figure 8: Average absolute value of AOA prediction error under different neu-
ron node arrangements.
CONCLUSIONS
In this paper, a two-layer SOM neural network is used to study the AOA
prediction problem based on DDOA vectors under arbitrary arrays in theory
and simulation experiments. This network is equivalent to a classifier,
through the classification of DDOA vectors to achieve the classification of
AOA values, so as to achieve the purpose of predicting AOA. The established
two-layer SOM neural network is further discussed, and the feasible situation
of applying the network for prediction is given. First, clarify the features
used for prediction and form the input vector, and the predicted object is
used as the output of the network.
This method is verified through simulation experiments and actual lake
experiments. From the experimental results, it can be seen that the neural
network trained in advance through simulation data can detect the direction
of arrival of the source signal without noise, Gaussian white noise, and
real noise environment, and the angle estimation effect is good. Finally, we
further compare the prediction effect of this method with the classic MUSIC
algorithm and RBF neural network method. The experimental results show
that the performance of this network is excellent and can be considered for
practice.
This paper applies SOM neural network to the estimation of the direction
of arrival of array signals. It is found through research that the DDOA vector
and AOA in the array signal have similar topological distributions. Based
on this, the SOM neural network is connected with the topological order
to establish a two-layer SOM neural network to estimate the direction of
arrival of the array signal. While the method has a theoretical basis, it also
shows high estimation accuracy in both simulation experiments and lake
water experiments.
ACKNOWLEDGMENTS
This work was supported by the National Natural Science Foundation of
China under Grant 61703280.
REFERENCES
1. X. Li, Y. Wang, and G. Liu, “Structured medical pathology data
hiding information association mining algorithm based on optimized
convolutional neural network,” IEEE access, vol. 8, no. 1, pp. 1443–
1452, 2020.
2. M. A. A. Mamun, M. A. Hannan, A. Hussain, and H. Basri,
“Theoretical model and implementation of a real time intelligent bin
status monitoring system using rule based decision algorithms,” Expert
Systems with Applications, vol. 48, pp. 76–88, 2016.
3. M. F. Hamza, H. J. Yap, and I. A. Choudhury, “Recent advances on
the use of meta-heuristic optimization algorithms to optimize the type-
2 fuzzy logic systems in intelligent control,” Neural Computing and
Applications, vol. 28, no. 5, pp. 1–21, 2015.
4. B. Tom and S. Alexei, “Conditional random fields for pattern
recognition applied to structured data,” Algorithms, vol. 8, no. 3, pp.
466–483, 2015.
5. Y. Chen, W. Zheng, W. Li, and Y. Huang, “The robustness and
sustainability of port logistics systems for emergency supplies from
overseas,” Journal of Advanced Transportation, vol. 2020, Article ID
8868533, 10 pages, 2020.
6. W. Quan, “Intelligent information processing,” Computing in Science
& Engineering, vol. 21, no. 6, pp. 4-5, 2019.
7. X. Q. Cheng, X. W. Liu, J. H. Li et al., “Data optimization of traffic video
vehicle detector based on cloud platform,” Jiaotong Yunshu Xitong
Gongcheng Yu Xinxi/Journal of Transportation Systems Engineering
and Information Technology, vol. 15, no. 2, pp. 76–80, 2015.
8. S. Wei, Z. Xiaorui, P. Srinivas et al., “A self-adaptive dynamic
recognition model for fatigue driving based on multi-source information
and two levels of fusion,” Sensors, vol. 15, no. 9, pp. 24191–24213,
2015.
9. M. Niu, S. Sun, J. Wu, and Y. Zhang, “Short-term wind speed hybrid
forecasting model based on bias correcting study and its application,”
Mathematical Problems in Engineering, vol. 2015, no. 10, 13 pages,
2015.
10. X. Song, X. Li, and W. Zhang, “Key parameters estimation and adaptive
warning strategy for rear-end collision of vehicle,” Mathematical
Problems in Engineering, vol. 2015, no. 20, Article ID 328029, 20

pages, 2015.
11. M. El-Banna, “A novel approach for classifying imbalance welding
data: mahalanobis genetic algorithm (MGA),” International Journal of
Advanced Manufacturing Technology, vol. 77, no. 1–4, pp. 407–425,
2015.
12. J. P. Amezquita-Sanchez and H. Adeli, “Signal processing techniques
for vibration-based health monitoring of smart structures,” Archives of
Computational Methods in Engineering, vol. 23, no. 1, pp. 1–15, 2016.
13. C. Li, X. An, and R. Li, “A chaos embedded GSA-SVM hybrid system
for classification,” Neural Computing and Applications, vol. 26, no. 3,
pp. 713–721, 2014.
14. A. Abboud, F. Iutzeler, R. Couillet, M. Debbah, and H. Siguerdidjane,
“Distributed production-sharing optimization and application to
power grid networks,” IEEE Transactions on Signal and Information
Processing Over Networks, vol. 2, no. 1, pp. 16–28, 2016.
15. X. Xu, D. Cao, Y. Zhou et al., “Application of neural network algorithm
in fault diagnosis of mechanical intelligence,” Mechanical Systems and
Signal Processing, vol. 141, no. Jul, pp. 106625.1–106625.13, 2020.
16. H. Xiao, B. Biggio, B. Nelson, H. Xiao, C. Eckert, and F. Roli,
“Support vector machines under adversarial label contamination,”
Neurocomputing, vol. 160, no. jul.21, pp. 53–62, 2015.
17. G. Kan, C. Yao, Q. Li et al., “Improving event-based rainfall-runoff
simulation using an ensemble artificial neural network based hybrid
data-driven model,” Stochastic Environmental Research & Risk
Assessment, vol. 29, no. 5, pp. 1345–1370, 2015.
18. S. Cuomo, G. De Pietro, R. Farina, A. Galletti, and G. Sannino, “A
novel O ( n ) numerical scheme for ECG signal denoising,” Procedia
Computer Science, vol. 51, no. 1, pp. 775–784, 2015.
19. H. Guo, G. Dai, J. Fan, Y. Wu, F. Shen, and Y. Hu, “A mobile sensing
system for UrbanPM2.5Monitoring with adaptive resolution,” Journal
of Sensors, vol. 2016, no. 9, Article ID 7901245, 15 pages, 2016.
20. Q. Liu, J. Liu, R. Sang et al., “Fast neural network training on FPGA
using quasi-Newton optimization method,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 26, no. 99, pp. 1575–
1579, 2018.
21. R. M. S. F. Almeida and V. P. De Freitas, “An insulation thickness

optimization methodology for school buildings rehabilitation
combining artificial neural networks and life cycle cost,” Journal of
Civil Engineering and Management, vol. 22, no. 7, pp. 915–923, 2016.
22. A. S. Soma, T. Kubota, and H. Mizuno, “Optimization of causative
factors using logistic regression and artificial neural network models
for landslide susceptibility assessment in Ujung Loe Watershed, South
Sulawesi Indonesia,” Journal of Mountain Science, vol. 16, no. 2, pp.
383–401, 2019.
23. S. C. Miao, J. H. Yang, X. H. Wang et al., “Blade pattern optimization of
the hydraulic turbine based onneural network and genetic algorithm,”
Hangkong Dongli Xuebao/Journal of Aerospace Power, vol. 30, no. 8,
pp. 1918–1925, 2015.
24. Z. Junsheng, Y. Gu, and Z. Feng, “Optimization of processing
parameters of power spinning for bushing based on neural network
and genetic algorithms,” Journal of Bjing Institute of Technology, vol.
28, no. 3, pp. 228–238, 2019.
25. N. Melzi, L. Khaouane, S. Hanini, M. Laidi, Y. Ammi, and H. Zentou,
“Optimization methodology of artificial neural network models for
predicting molecular diffusion coefficients for polar and non-polar
binary gases,” Journal of Applied Mechanics and Technical Physics,
vol. 61, no. 2, pp. 207–216, 2020.
26. N. Chen, B. Rong, X. Zhang, and M. Kadoch, “Scalable and
flexible massive MIMO precoding for 5G H-cran,” IEEE Wireless
Communications, vol. 24, no. 1, pp. 46–52, 2017.
27. R. Noorossana, A. Zadbood, F. Zandi et al., “An interactive artificial
neural networks approach to multiresponse optimization,” International
Journal of Advanced Manufacturing Technology, vol. 76, no. 5–8, pp.
765–777, 2015.
28. D. Sánchez, P. Melin, and O. Castillo, “Optimization of modular
granular neural networks using a firefly algorithm for human
recognition,” Engineering Applications of Artificial Intelligence, vol.
64, no. sep, pp. 172–186, 2017.
Chapter 18
Information Processing Features

Can Detect Behavioral Regimes of
Dynamical Systems
Rick Quax1 , Gregor Chliamovitch2 , Alexandre Dupuis2 , Jean-Luc Falcone2 ,

Bastien Chopard2 , Alfons G. Hoekstra1,3 , and Peter M. A. Sloot1,3,4
1
2
3
4
Complexity Institute, Nanyang Technological University, Singapore
ABSTRACT
In dynamical systems, local interactions between dynamical units generate
correlations which are stored and transmitted throughout the system,
generating the macroscopic behavior. However a framework to quantify
exactly how these correlations are stored, transmitted, and combined
at the microscopic scale is missing. Here we propose to characterize the
notion of “information processing” based on all possible Shannon mutual
Citation: Rick Quax, Gregor Chliamovitch, Alexandre Dupuis, Jean-Luc Falcone, Bas-
tien Chopard, Alfons G. Hoekstra, Peter M. A. Sloot, “Information Processing Features
Can Detect Behavioral Regimes of Dynamical Systems”, Complexity, vol. 2018, Ar-
ticle ID 6047846, 16 pages, 2018. https://doi.org/10.1155/2018/6047846.
information quantities between a future state and all possible sets of initial
states. We apply it to the 256 elementary cellular automata (ECA), which
are the simplest possible dynamical systems exhibiting behaviors ranging
from simple to complex. Our main finding is that only a few information
features are needed for full predictability of the systemic behavior and that
the “information synergy” feature is always most predictive. Finally we
apply the idea to foreign exchange (FX) and interest-rate swap (IRS) time-
series data. We find an effective “slowing down” leading indicator in all
three markets for the 2008 financial crisis when applied to the information
features, as opposed to using the data itself directly. Our work suggests that
the proposed characterization of the local information processing of units
may be a promising direction for predicting emergent systemic behaviors.
INTRODUCTION
Emergent, complex behavior can arise from the interactions among (simple)
dynamical units. An example is the brain whose complex behavior as a
whole cannot be explained by the dynamics of a single neuron. In such a
system, each dynamical unit receives input from other (upstream) units
and then decides its next state, reflecting these correlated interactions.
This new state is then used by (downstream) neighboring units to decide
their new states and so on, eventually generating a macroscopic behavior
with systemic correlations. A quantitative framework is missing to fully
trace how correlations are stored, transmitted, and integrated, let alone to
predict whether a given system of local interactions will eventually generate
complex systemic behavior or not.
Our hypothesis is that Shannon’s information theory [1] can be used to
construct, eventually, such a framework. In this viewpoint, a unit’s new state
reflects its past interactions in the sense that it stores mutual information
about the past states of upstream neighboring units. In the next time instant a
downstream neighboring unit interacts with this state, implicitly transferring
this information and integrating it together with other information into its
new state and so on. In effect, each interaction among dynamical units is
interpreted as a Shannon communication channel and we aim to trace the
onward transmission and integration of information (synergy) through this
network of “communication channels.”
In this paper we characterize the information in a single unit’s state at
time t by enumerating its mutual information quantities with all possible sets
of initial unit states (t=0). We generate initial unit states independently for
Information Processing Features Can Detect Behavioral Regimes of ... 355
the elementary cellular automata (ECA) application. Then we characterize

“information processing” as the progression of a unit’s vector of information
quantities over time (see Methods). The rationale behind this is as follows.
The information in each initial unit state will be unique by construction,
that is, have zero redundancy with all other initial unit states. Future unit
states depend only on previous unit states and ultimately on the initial unit
states (there are no outside forces). “Processing” refers, by our definition,
to the fact that the initial (unique) pieces of information can be considered
to disperse through the system in different directions and at different levels
(synergy), while some of it dissipates and is lost. We can exactly trace all
these directions and levels or every bit of information in the ECA due to the
uniqueness of the initial information by construction. Therefore we would
argue that we can then fully quantify the “information processing” of a
system, implicitly, without knowing exactly which (physical) mechanism
is actually responsible for this. We anticipate that this is a useful abstraction
which will aid in distinguishing different emergent behaviors without being
distracted by physical or mechanistic details. We first test whether this
notion of information processing could be used to predict complex emergent
behavior in the theoretical framework of ECA, under ideal conditions by
construction. Next we also test if information processing could be used to
detect a difference of systemic behavior in real financial time-series data,
namely, the regimes before and after the 2008 crisis, despite the fact that
obviously this data does not obey the strict ideal conditions.
The study of “information processing” in complex dynamical systems is
a recently growing research topic. Although information theory has already
been applied to dynamical systems such as elementary cellular automata,
including, for instance, important work by Langton and Grassberger [2,
3], here we mean by “information processing” a more holistic perspective
of capturing all forms of information simultaneously present in a system.
As illustrative examples, Lizier et al. propose a framework to formulate
dynamical systems in terms of distributed “local” computation: information
storage, transfer, and modification [4] defined by individual terms of the
Shannon mutual information sum (see (3)). For cellular automata they
provide evidence for the long-held conjecture that so-called particle
collisions are the primary mechanism for locally modifying information, and
for a networked variant they show that a phase transition is characterized by
the shifting balance of local information storage over transfer [5]. A crucial
difference with our work is that we operate in the ensemble setting, as is
usual for Shannon information theory, whereas Lizier et al. study a single
realization of a dynamical system, for a particular initial state. (Although

time-series data is strictly speaking a single realization, ensemble estimates
are routinely made from such data by using sliding windows; see Methods.)
Beer and Williams trace how task-relevant information flows through a
minimally cognitive agent’s neurons and environment to ultimately be
combined into a categorization decision [6] or sensorimotor behavior [7],
using ensemble methods. Studying how local interactions lead to multiscale
systemic behavior is also a domain which benefits from information-
theoretic approaches, such as those by Bar-Yam et al. [8, 9], Quax et al.
[10, 11], and Lindgren [12]. Finally, extending information theory itself to
deal with complexity, multiple authors are concerned with decomposing a
single information quantity into multiple constituents, such as synergistic
information, including James et al. [13], Williams and Beer [14], Olbrich et
al. [15], Quax et al. [16], Chliamovitch et al. [17], and Griffith et al. [18, 19].
Although a general consensus on the definition of “information synergy” is
thus still elusive, in this paper we circumvent this problem by focusing on
the special case of independent input variables, in which case a closed-form
formula (“whole-minus-sum”) is well-known and used.
METHODS
Notational Conventions
Constants and functions are denoted by lower-case Roman letters. Stochastic
variables are denoted by capital Roman letters. Feature vectors are denoted
by Greek letters.
Model of Dynamical Systems

In general we consider discrete-time, discrete-state Markov dynamics.
Let denote the stochastic variable of the system
state defined as the sequence of N unit states at time t. Each unit chooses
its new state locally according to the conditional probability distribution
, encoding the microscopic system mechanics where i
identifies the unit. The state space of each unit is equal and denoted by
the set Σ. We assume that the number of units, the system mechanics,
and the state space remain unchanged over time. Finally we assume that
all unit states are initialized identically and independently (i.i.d.); that is,
. The latter ensures that all correlations in future
system states are generated by the interacting units and not an artifact of the
initial conditions.
Elementary Cellular Automata

Specifically we focus on the set of 256 elementary cellular automata
(ECA) which are the simplest discrete spatiotemporal dynamical systems
possible [20]. Each unit has two possible states and chooses its next state
deterministically using the same transition rule as all other cells. The next
state of a cell deterministically depends only on its own previous state and
that of its two nearest neighbors, forming a line network of interactions.
That is,
(1)
There are 256 possible transition rules and they are numbered 0 through
255, denoted . As initial state we take the fully random state
so that no correlations exist already at t=0; that is, for all
r and all i. The evolution of each cellular automaton is fully deterministic
for a given rule, implying that the conditional probabilities in (1) can only
be either 0 or 1. (This is nevertheless not a necessary condition in general.)
Quantifying the Information Processing in a Dynamical Model
Basics of Information Theory

We characterize each new unit state, determined probabilistically by
, by a sequence of Shannon communication channels,
where each channel communicates information from a subset of
. In general, a communication channel between two stochastic
variables is defined by the one-way interaction and is
characterized by the amount of information about the state A which transfers
to the state B due to this interaction. The average amount of information
stored in the sender’s state A is determined by its marginal probability
distribution Pr(A), which is known as its Shannon entropy:
(2)
After a perfect, noiseless transmission, the information at the receiver B

would share exactly H(A) bits with the information stored at the sender A.
After a failed transmission the receiver would share zero information with
the sender, and for noisy transmission their mutual information is somewhere
in between. This is quantified by the so-called mutual information:
(3)
The conditional variant obeys the chain rule
and is written explicitly as
(4)
This denotes the remaining entropy (uncertainty) of A given that the value
for B is observed. For intuition it is easily verified that the case of statistical
independence, that is, , which
makes , meaning that B contains zero information about A. At
the other extreme, B=A would make so that
, meaning that B contains the maximal amount of information needed to
determine a unique value of A.
Characterizing the Information Stored in a Unit’s State

First we characterize the information stored in a unit’s state at time step t,
denoted by , as the ordered sequence of mutual information quantities
with all possible sets of unit states at time t=0; that is,
(5)
Here denotes the (ordered) power set notation for all subsets of
stochastic variables of initial cell states. (Note though that in practice not
infinitely many initial cell states are needed; for instance, for an ECA at time
t only the nearest 1+2t initial cell states are relevant.) We will refer to
as the sequence of information features of unit i at time t. The subscript
notation implies that the rule-specific (conditional) probabilities
are used to compute the mutual information. We
use the subscript i for generality to emphasize that this feature vector pertains
to each single unit (cell) in the system, even though in the specific case of
ECA this subscript could be dropped as all cells are indistinguishable.
In particular we highlight the following three types of information
features. The “memory” of unit i at time t is defined as the feature
, that is, the amount of information that the unit retains
about its own initial state. The “transfer” of information is defined as
nonlocal mutual information such as . Nonlocal
mutual information must be due to interactions because the initial states are
independent (all pairs of units have zero mutual information). Finally we
define the integration of information as “information synergy,” an active
research topic in information theory [4, 14, 16, 19, 21–23]. The information
synergy in about X0 is calculated here by the well-known whole-minus-
sum (WMS) formula . The WMS measure directly
implements the intuition of subtracting the information carried by individual
variables from the total information. However the presence of correlations
among the would be problematic for this measure, in which case it can
become negative. In this paper we prevent this by ensuring that the are
uncorrelated. In this case it fulfills various proposed axiomatizations for
synergistic information known thus far, particularly PID [14, 15] and SRV
[16].
Information synergy (or “synergy” for short) is not itself a member of
but it is fully redundant given since each of its terms is in
. Therefore we will treat synergy features as separate single features in our
results analysis while we do not add them to .
Predicting the Class of Dynamical Behavior Using Information

Processing Features
We require a classification of (long-term) dynamical behavior of the systems
under scrutiny which is to be predicted by the information features. In this
paper we choose the popular Wolfram classification for the special case of
ECA.
Behavioral Class of a Rule

Wolfram observed empirically that each rule tends to
evolve from a random initial state to one of only four different classes of
dynamical behavior [20]. These de facto established behavioral classes
are(1)homogeneous (all cells end up in the same state);(2)periodic (a small
cycle of repeating patterns);(3)chaotic (pseudorandom patterns);(4)complex
(locally stable behavior and long-range interactions among patterns).
These classes are conventionally numbered 1 through 4, respectively.
We obtained the class number for all 256 rules from Wolfram Alpha [24]
and denote it by . When the rule number is treated as a
stochastic variable it will be denoted R; similarly, if the class number is
treated as stochastic variable it will be denoted .
Predictive Power of the Information Processing Features

We are interested in the inference problem of predicting the class number
based on the observed information features . Here the rule number
R is considered a uniformly random stochastic variable, making in turn
functions such as also stochastic variables. We formalize the prediction
problem by the conditional probabilities . That is, given
only the sequence of information features of a specific (but unknown) rule
at time t, what is the probability that the ECA will eventually exhibit
behavior of class ? We can interpret this problem as a communication
channel and quantify the predictive power of using
the mutual information . The predictive power is thus zero
in case the information features do not reduce the uncertainty about
, whereas it achieves its maximum value in case a sequence of
information features always uniquely identifies the behavioral class
. We will normalize the predictive power as .
For the Wolfram classification, , which is lower than the
maximum possible (2.0) since there are relatively many rules with class 2
behavior and not many complex rules.
Note that a normalized predictive power of, say, 0.75 does not necessarily
mean that 75% of the rules can be correctly classified. Our definition yields
merely a relative measure where 0 means zero predictive power, 1 means
perfect prediction, and intermediate values are ordered such that a higher
value implies that a more accurate classification algorithm could in principle
be constructed. The benefit of our definition based on mutual information is
that it does not depend on a specific classifier algorithm; that is, it is model-
free. Indeed, the use of mutual information as a predictor of classification
accuracy has become the de facto standard in machine learning applications
[25, 26].
Selecting the Principal Features

Some information features are more predictive than others for determining
the behavioral class of a rule. Therefore we perform a feature selection
process at each time t to find these “principal features” as follows. First we
extend the set of information features by the following set of synergy
features:
(6)
Their concatenation makes the extended ordered feature set, now written
in the form of stochastic variables:
(7)
The extended feature set has no additional predictive power
compared to , so for any inference task and are equivalent.
That is, the synergy features are completely redundant given
since each of its terms is a member of . The reason for adding them
separately to form is that they have a clear meaning as information
which is stored in a collection of variables while not being stored in any
individual variable. We are interested to see whether this phenomenon plays
a significant role in generating dynamical behaviors.
We define the first principal feature at time t as maximizing its individual
predictive power, quantified by a mutual information term as explained
before, as
(8)
Here, again rule number R is treated as a uniformly random stochastic
variable with which in turn makes
stochastic variables. In words, is the single most predictive information

feature about the behavioral class that will eventually be generated. More
generally, the principal set of n features is identified in similar spirit; namely,
(9)
Information-Based Classification of Rules

The fact that Wolfram’s classification relies on the behavior exhibited by a
particular initial configuration makes the complexity class of an automaton
dependent on the initial condition. Moreover, there is no universal agreement
regarding how “complexity” should be defined and various alternatives to
Wolfram’s classification have been proposed, although Wolfram’s remains
by far the most popular. Our hypothesis is that the complexity of a system
has very much to do with the way it processes information. Therefore we
also attempt to classify ECA rules using only their informational features.
We use a classification algorithm which takes as input the 256 vectors
of information features and computes the Euclidean distance between
these vectors. The two vectors nearest to each other are clustered together.
Then the remaining nearest elements or clusters are clustered together. The
distance between two clusters is defined as the distance between the most
distant elements in each cluster. The result is a hierarchy of clusters with
different distances which we visualize as a dendrogram.
Computing Information Processing Features in Foreign

Exchange Time-Series
In the previous section we define information processing features for the
simplest (one-dimensional) model of discrete dynamical systems. In the
second part of this paper we aim to investigate if information features can
distinguish “critical” regimes in the real complex dynamical system of
the foreign exchange market. Most importantly, we are interested in the
behavior of the information features before, at, and after the start of the 2008
financial crisis, which is commonly taken to coincide with the bankruptcy

of Lehman Brothers on September 15, 2008. We consider two types of time-
series datasets in which the dynamical variables can be interpreted to form
a one-dimensional system in order to stay as close as possible to the ECA
modeling approach.
The information features can then be computed as discussed above,
except that each mutual information term is now estimated directly from the
data. This estimation is performed within a sliding window of length w up
to time point t which enables us to see how the information measures evolve
over time t. For instance, the memory of variable X at time t’ will be measured
as where the joint probability distribution
is estimated using only the data points . Details regarding

the estimation procedure are given in the following subsection. The ith time-
series in a dataset will be denoted by subscript as in .
Estimating Information Processing Features from the Data

The mutual information between two financial variables (time-series) at
time t is estimated using the k-nearest-neighbor algorithm using the typical
setting k=3 [27]. This estimation is calculated using a sliding window of size
w leading up to and including time point t, after first detrending each time-
series using log-returns. For all results we will use 200 uniformly spaced
values for t over the dataset, starting at datapoint w and ending at the length
of the dataset. Thus windows partially overlap.
We calculate the “memory” (M) of a time-series i as

and the average “transfer” (T) as . That is,
whereas in the ECA model we calculated the mutual information quantities
with respect to the initial state of the model, here we use consecutive time
points, effectively treating as the initial state and characterizing only
the single time step to .
For calculating the synergy measure (S) we apply a correction which
makes this measure distributed around zero. The reason is that the WMS
measure of (6) assumes independence among the stochastic (initial) state
variables Si, which for the real data are taken to be the previous day’s time-
series values. When this assumption is violated, it can become strongly
negative and, more importantly, cointegrated with the memory and transfer
features whose sum will then dominate the synergy feature. We remedy
this by rescaling the sum of the memory and transfer features which are
subtracted in (6) to equal the average value of the total information (positive
term in (6)). In formula, a constant c is inserted into the WMS formula,
leading to
(10)
for a given set of initial cell states S. c is fitted such that this WMS measure
is on average 0 for all sliding windows over the dataset. This rejects the
cointegration null-hypothesis between total information and the subtracted
term at the 0.05 significance level ( ) in this dataset. This
results in the synergy feature being distributed around zero and being
independent of the sum of the other two features so that it may functionally
be used as part of the feature space for feature selection; however, the value
itself should not be trusted as quantifying precisely the notion synergy.
Description of the Foreign Exchange Data

The first data we consider are time-series of five foreign exchange (FX)
daily closing rates (EUR/USD, USD/JPY, JPY/GBP, GBP/CHF, and CHF/
EUR) for the period from January 1, 1999, to April 21, 2017 [28]. Each
currency pair has a causal dependence on its direct neighbors in the order
listed because they share a common currency. For instance, if the EUR/
USD rate changes then USD/JPY will quickly adjust accordingly (among
others) because the rate imbalance can be structurally exploited for profit. In
turn, among others through the rate JPY/EUR (not observed in this dataset)
the rate EUR/USD will also be adjusted due to profit-making imbalances,
eventually leading to all neighboring rates returning to a balanced situation.
Description of the Interest-Rate Swap Data

The second data are interest-rate swap (IRS) daily rates for the EUR and
USD market [11]. The data spans over twelve years: the EUR data from
January 12, 1998, to August 12, 2011, and the USD data from April 29,
1999, to June 6, 2011. The datasets consist of 14 and 15 times to maturity
(durations), respectively, ranging from 1 year to 30 years. Rates for nearby
maturities have a dependency because the higher maturity can be constructed
by the lower maturity plus a delayed (“forward”) short-term swap. This basic
mechanism between maturities leads to generally monotonically upward
“swap curves.”
RESULTS
Predicting the Wolfram Class of ECA Rules Using Information

Processing Features
Information Processing in the First Time Step

The information processing occurring in the first time step of each ECA rule
is characterized by the corresponding feature set
, consisting of 7 time-delayed mutual information quantities ( , (5))
and 4 synergy quantities ( , (6)). We show three selected features
(memory, total transfer, and total synergy) for all 256 rules as points in a
vector-space in Figure 1 along with each rule’s Wolfram class as a color code.
It is apparent that the three features already partially separate the behavioral
classes. Namely, it turns out that chaotic and complex rules tend to have
high synergy, low information memory, and low information transfer. Figure
1 also relates intuitively to the classic categorization problem in machine
learning; namely, perfect prediction would be equivalent to the existence
of hyperplanes that perfectly separate all four behavior classes. In the case
of ECA the information features are deterministic calculations for each rule
number. Thus forms a discrete distribution of 256 points such that

separability implies that no two rule numbers fall on exactly the same point
in this information space.
1.0
0.8
0.6
S
0.4
0.2
0.0
1.0
0.8
0.6
0.4 1.0 1.2
0.2 0.6 0.8
T 0.0 0.2 0.4
0.0
M
Figure 1: Three selected features from for each ECA rule

, namely, (“memory”), (“transfer”),
and (“synergy”).
Each dot corresponds to a rule r and is color-coded by its Wolfram class c(r),
namely, black for the simple homogeneous (24) and periodic behaviors (196),
green for complex behavior (10), and red for chaotic behavior (26). The trans-
parency of a point indicates its distance away from the viewer, with more trans-
parent points being farther away. A small random vector with average norm 0.02
is added to each point in order to make rules with equal information features
still visible. The gray plus signs are the projections of the 3D points on the two
visible side faces (the S-M face is occluded) for better visibility of the positions
of the points.
Predictive Power of Information Processing Features Over Time

The single most predictive information feature is synergy, as shown in
Figure 2. Its predictive power is about 0.37 (where 1.0 would mean perfect
prediction). The most predictive pair of features is formed by adding
information transfer at 0.43, so adding the information transfer feature
increases the predictive power by 0.06. The information transfer feature by
itself has actually over three times this predictive power at 0.19, showing
that two or more features can significantly overlap in their prediction of the
behavioral class. The total predictive power of all information processing
features at t=1 is 0.49, formed by 4 of the 11 possible information features.
t=1 t=2 t=3
1.0 1.0 1.0
Predictive power I(Irt : Cr )/H(Cr )
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0

S111 I101 I011 I011 I001 S11111 I10101 I00110 S1111111 I1001001 I1001001
S111 I110 I110 I011 S11111 I10011 S1111111 S0010010
S111 S101 I110 S11111 S1111111
S111 S101
S111 Feature set
(a) (b) (c)
Figure 2: Predictive power of the optimal set of n information features as function

of n (thick blue line with round markers). For each n the best set of information
features is listed: S means “synergy” and I means mutual information, followed
by a bitmask of which neighbor cell’s initial state is included in the feature
(middle bit is the cell itself). For example, I010 indicates the “memory” feature
, and S111 denotes . The
thin black line with error bars is the “base line” predictability for n features,
obtained by randomizing the pairing of 256 information features with their class
identifier. The error bar indicates the 95% confidence interval of the distribution
of predictive power under the null-hypothesis of zero correlation between infor-
mation features and class identifier. Finally, the small black markers indicate the
predictive powers of all other information feature sets.
For the second time step (Figure 2) we again find that the most predictive
information feature is synergy. An intriguing difference however is that it is
now significantly more predictive at 0.90. This means that already at t=2 there
is a single information characteristic of dynamical behavior (i.e., synergy),
which explains the vast majority of the entropy of the behavioral
class that will eventually be exhibited. A second intriguing difference is that
the maximum predictive power of 0.98 is now achieved using only 3 out of
57 possible information features, where 4 features were needed at t=1.
Finally, for t=3 we find that only 2 information features are needed to
achieve the maximum possible predictive power of 1.0; that is, the values
for these two features uniquely identify the behavior class. Firstly this
confirms the apparent trend that fewer information features capture more of
the relevant dynamical behavior as time t increases. Secondly we find again
that synergy is the single most predictive feature. In addition, we find again
that the best secondary feature is a peculiar combination of memory and the
two longest-range transfers, as in t=2. Including the intermediate transfers
(so adding I1111111 instead of I1001001 as second feature) actually only
slightly convolutes the prediction: adding them in t=2 reduces predictive
power by 0.028, whereas in t=3 it does not reduce the predictive power at
all. In t=1 there are no intermediate transfers possible since there are only
three predecessors of a cell’s state, and apparently then it pays off to leave
out memory (which would reduce power by 0.025 if added).
One could argue that the quick separation of the points in information
space is hardly surprising because a high-dimensional space is used to
separate only a small number (256) of discrete points. To validate that the
predictive power values of the information features are indeed meaningful
we also plot the expected “base line” prediction power in each subfigure in
Figure 2 along with the 95% confidence interval. The base line is the null-
hypothesis formed by randomizing the pairing of information feature values
with class identifiers; that is, it shows the expected predictive power of having
the same number and frequencies of feature values but sampled with zero
correlation with the classification to make the separability meaningless. This
results in a statistical test with a null-hypothesis of the information features

and Wolfram classification being uncorrelated. We find that the predictive
power of the information features is always significantly above the base
distributions. Therefore we consider the meaningfulness (or “surprise”) of
the information features’ separability validated; that is, we reject the null-
hypothesis at the 95% confidence level that the observed quick separation in
information space is meaningless and merely due to dimensionality.
Relation to Langton’s Parameter

Langton’s parameter [2] is the most well-known single feature of an ECA
rule which partially predicts the Wolfram class. It is a single scalar computed
for rule r as . It is known that the parameter is
more effective for a larger state space and a larger number of interactions;
however we briefly highlight it here because of its widespread familiarity
and because the information processing measures can be written in terms of
“generalized” parameters. This means that ’s relation with the Wolfram
classification is captured within the information features, implying that the
information is minimally as predictive as features based on parameter(s).
Indeed, the predictive power 0.175,
which is significantly lower than the information synergy alone which
achieves 0.361 at t=1. Moreover, as indicated by the black dots in Figure
2(a) the vast majority of information features have higher predictive power
than 0.175; in fact only three single features have slightly lower power. It
is not surprising yet reassuring that the information features outperform the
Langton parameter.
Information Processing-Based Clustering

In the previous sections we showed how information features predict
Wolfram’s behavioral classification. In this section we investigate the
hierarchical clustering of rules induced by the information features in
their own right. One important reason for studying this is the fundamental
problem of undecidability of CA classifications based on long-term behavior
characteristics [29, 30], such as for the Wolfram classification. In the best
case this makes the classification difficult to compute; in the worst case
the classification is impossible to compute, leading to questions about its
meaning. In contrast, if local information features correlate strongly with
long-term emergent behaviors, then a CA classification scheme based on
information features is practically more feasible. In this section we visualize

how the clustering overlaps with Wolfram’s classification.
Figure 3(a) shows a clustering made using information features evaluated
where t=0 is the randomized state. Interestingly, while the features have low
predictability on the Wolfram class, the resulting clustering also does not
overlap at all with Wolfram classification. We have to make an exception
for rules 60, 90, 105, and 150 which are all chaotic rules and share the same
information processing features.
138
126
232
154
170
106
42
94
29
162
46
77
19
58
58
1105
178
62
50
46
43
34
1
23
7 32
268
10
305
777
3
4 3
1
3 1
4 2
7
2 7
4
2 5
51 8
13
10 4
16 32
22 5
16
2
4
0 37 9 36 30
10
128 1744
8
1 84 1
1 8 72 2 8
0
1 17 6 2 6 16 6
2 14 20004 13
4 56 8
8 1
34
50
57 1402 40
160
162 54 106 32
5 51 154 128
28 122 26 0
73 232 38 94
3 172 22 29
6 152 150 35
9 10 105 27
18 16 4 60 15
33 74 4 90 28 6
24 4 73 3
14 57
368 13 34 25 1 18 7
16 36 8 4 8
1 6
4
1 10
42 70
14 4
40 14 6
1 4
1 3
2
1
4
13
5
20 30
13 74
11
15 1
4
0
1
12
56
7 2
1
14
45
1322
24
1 nat
78
30
152
7
1
156
110
204
108
122
33
76
06
09
54
140
62
105
9
126
150
(a) (b)
Figure 3: Dendrograms displaying two clusterings of information processing

features: (a) the information measures are computed between time t=0 and t=1,
with no correlation between sites at time t=0, and (b) when information process-
ing features are evaluated numerically between time t and t+1 once a steady
state is reached. The color code is the same as in Figure 1 except that rules in
class I are displayed in blue to distinguish them from class II.
Figure 3(b), on the other hand, displays the clustering for the case where
features are not evaluated with respect to the randomized state but to the
stationary distribution (i.e., Xt=0 is a randomly selected system state from
the cyclic attractor). One reason for this is to ignore the initial transient
phase; another reason is to make a step toward the financial application in
the next subsection, which obviously does not start from random initial
conditions. By “stationary” we mean that we simulate the CA dynamics
long enough until the time-shifted information features no longer have a
changing trend. Feature values can no longer be calculated exactly and thus
are estimated numerically by sampling initial conditions and for size N=15.
In that case we find that the clustering has increased overlap with Wolfram’s
classification. In particular, we can note that uniform rules cluster together
and that chaotic and complex rules are all on the same large spray. However
the agreement is far from perfect. For instance, the spray bearing chaotic
and complex rules also bears periodic rules. Note also that rules 60, 90,
105, and 150 are indistinguishable when considered from the information
processing viewpoint, even though they exhibit chaotic patterns that can be
visually distinguished from each other. On the contrary, rules 106 and 154
are very close to each other and the pattern they exhibit indeed shows some
similarities, but the former is complex while the latter is periodic.
Note that using this clustering scheme all rules converging to a uniform
pattern, but one, are close to each other in the information features space.
The remaining one, rule 168, has a transient regime which is essentially
dominated by a translation of the initial state. This unexpected behavior is
due to rare initial conditions (e.g., …110110110…) that are present in our
exact calculation with the same weight as all other initial conditions but have
a strong impact on the information processing measure. This translational
regime can be found as well in rules 2 and 130, which are classified in
the same subspray as rule 168. The similarity of any information feature
(information transfer in this case) can thus lead to rules whose behavior
differs in other respects to get classified similarly.
Detecting a Regime Shift Financial Time-Series

The results for the ECA models are promising but are under ideal and
controlled conditions by construction: independent initial states, no external
influences, and an enumerable list of dynamics. It is natural to ask whether
the same formalism could eventually turn out valuable when analyzing
real data of real complex systems, despite the fact that they do not obey
such ideal conditions and cannot be controlled. A positive answer would
add more urgency to further studying the idea of “information processing”
systematically in models beyond ECA and toward real systems and data. A
negative answer on the other hand could hint toward the idea being restricted
to the specific realm of ECA and perhaps some similar simplistic models, but
not being useful for studying real systems, in which case further systematic
study would be less urgent. It is therefore important to find out whether
there can be any hope of such a positive answer, which is the purpose of this
section to demonstrate. More systematic studies remain nevertheless needed
to understand how and why information processing features are predictive
of emergent behaviors.
We focus on financial data because it is of high quality and available

in large quantities. Also at least one large regime shift is known: the 2008
financial crisis, separating a precrisis and a postcrisis regime presumably
by a “tipping point” phase of high systemic instability. We set the date of
the financial crisis on September 15, 2008, which is the date of the Lehman
Brothers bankruptcy.
We focus on two sequences of time-series: daily IRS rates of 14 and
15 maturities in the USD and EUR markets, and five consecutive daily FX
closing exchange rates. We selected these datasets because the variables in
each dataset can be considered to form a line graph similar to ECA rules,
staying as close as possible to the previous analysis. Also, these two markets
play a major role in today’s global economy: the IRS market is the largest
financial derivatives market, whereas the FX market is by far the largest
trading market in terms of volume. Even though it remains yet unclear how
exactly the crisis was driven by the different markets, we assume that we
can at least measure a regime shift or a growing instability in each dataset.
At this point the information feature values are not (yet) understood in
their own right, and the absolute value of the synergy feature is not meaningful
because it is rescaled to avoid cointegration (see Methods). Nevertheless the
information features do offer an alternative and potentially complementary
way to parameterize the state space of the financial market, as opposed to
using directly the observations themselves (interest or exchange rates in this
case). For financial markets this is especially important because structures
with predictive power quickly disappear from the system once observed,
which is a rather unique property of financial markets. As Scheffer et al. [31]
phrase it: “In this field, the discovery of predictability quickly leads to its
elimination, as profit can be made from it, thereby annihilating the pattern.”
Our proposed parameterization in terms of information features may not yet
be exploited on a large-scale. This potentially means that the financial crisis
may become detected or even anticipated when using standard model-free
instability (or tipping point) indicators applied to the information features
time-series, as opposed to using the original financial time-series data itself
directly, which we explore in this section. In the end we propose a new
model-free tipping point indicator for multivariate, nonstationary time-
series applied to the main information features.
Foreign Exchange Market

In Figure 4 we show the 3-dimensional “information feature space” with the
same axes as Figure 1. We observe remarkable separation of the precrisis and
postcrisis periods which are well separated by a single transition trajectory
immediately following the crisis date (black circle). Looking more closely,
we also observe that early in the precrisis regime the information features
traverse steadily but surely through the entire blue attractor (dark and
medium blue dots). Directly preceding the regime shift the information
features appear more clustered in one region in the lower part of the attractor
(light blue dots), without a clear general direction. Soon after this “noisy
stationary” phase there is evidently a clear direction again when it traverses
from the blue to the red attractor. In the red attractor the system appears to
steadily traverse through the attractor again; that is, it appears stationary on
longer time scales but nonstationary on shorter time scales.
FX
0.02
0.01
0.00
−0.01 S
−0.02
0.025 −0.03
0.015
−0.04
0.005
T −0.005 0.01 0.00 −0.01
0.03 0.02
−0.015 0.05 0.04
0.06
M
Figure 4: 200 time points showing the progression of the three information fea-
tures memory (M), transfer (T), and integration (S) computed with a time delay
of 1 day (similar to t=1 for ECA). The color indicates the time difference with
September 15, 2008 (big black dot), which we consider the starting point of
the 2008 crisis, from dark blue (long before) to dark red (long after) and white
at the crisis date. The data spans from January 1, 1999, to April 21, 2017; the
large green dot is the last time point also present in the IRS data in 2011. In this
information space we clearly observe signs of a two attractor regimes separated
by a sudden regime shift. Mutual information is calculated using a sliding
window of w=1400 days; the 200 windows partially overlap and are placed
uniformly over the dataset, where the first and last window include the first and
last day of the dataset, respectively. The gray plus signs are the projections of
the 3D points on the visible side faces for better visibility of the positions of
the points.
Interestingly, this behavior resembles to some extent the dynamics
observed for the so-called tipping points [31] where a system is slowly
pushed to an unstable point and then “over the hill” after which it progresses
quickly “downhill” to the next attractor state. This is relevant because slow
progressions to tipping points offer a potential for developing an early-
warning signal.
Interest-Rate Swap Market

In Figure 5 we show the same feature space for the IRS markets in EUR
and USD (EURIBOR and LIBOR). In short, interest-rate swaps consist
of transactions exchanged between two agents such that their opposing
risk exposures are effectively canceled. In contrast to the FX market, in
IRS we observe the completely different scenario of steady nonstationary
progressions of the information features during most of the duration of the
dataset. One possible explanation is that these markets had not yet settled
into an equilibrium, as they are relatively young markets (1986 and 1999
in their present form) continually influenced by market reforms and policy
changes. A second possible explanation is that the contracts traded in this
market are relatively long-term contracts, covering periods from a few
months to a few decades, influencing subsequent traded contracts, whereas
FX trades are instant and do not involve contracts.
EUR USD
0.06
0.04
0.075
0.02 0.050
0.025
0.00
0.000 S
−0.02 S −0.025
−0.04 −0.050
−0.075
−0.06
0.00
−0.08
0.65 0.02
0.60 0.04
−0.10 0.55
0.12 0.50 0.06
0.08 −0.075
0.45 0.08 M
0.04 0.025 0.000
−0.025 −0.050 T 0.40 0.10
T 0.00 0.125 0.100
0.075 0.050
M
Figure 5: 200 time points showing the progression of the three information
features memory (M), transfer (T), and synergy (S) computed with a time delay
of 1 day (similar to t=1 for ECA). The color indicates the time difference with
September 15, 2008 (big black dot), which we take as the starting point of the
2008 crisis, from dark blue (long before) to dark red (long after) and white
at the crisis date. The data spans more than twelve years: the EUR data from
January 12, 1998, to August 12, 2011, and the USD data from April 29, 1999,
to June 6, 2011. Mutual information is calculated using a sliding window of
w=1400 days; the 200 windows partially overlap and are placed uniformly over
the dataset, where the first and last window include the first and last day of the
dataset, respectively. The gray plus signs are the projections of the 3D points on
the visible side faces for better visibility of the positions of the points.
Yet another possible but hypothetical explanation for this is that the
IRS markets could have been (part of) a slow but steady driving factor in
the global progression to the crisis, perhaps even building up a financial
“bubble,” whereas the FX market may have been more exogenously
forced toward their regime shift from one attractor to another. Indeed, the
progression to the 2008 crisis is often explained by referring at least to
substantial losses in fixed income and equity portfolios followed by the US
subprime home loan turmoil [32], suggesting at least a central role for the
trade of risks concerning interest rates in USD. The exact sequence of events
leading to the 2008 crisis is however still debated among financial experts.
Our numerical analyses may nevertheless help to shed light on interpreting
the relative roles of different markets.
In any case, in the EUR plot we observe that a steady and fast progression
is following as well by a short “noisy stationary” period where there seems
to be no general direction, after which a new and almost orthogonal direction
is followed after the crisis. The evolution after the crisis is much more noisy,
in the form of larger deviations around the general direction. In the USD we
do not observe a brief stationary phase before the crisis, but we do observe
larger deviations as well around the general directions, mostly sideways
from the viewpoint of this plot. The market does contain two directional
changes but these do not occur closely around the crisis point. We do not
speculate here about their possible causes.
Potential Indicator for the Financial Crisis

All in all we observe in all three financial markets that the information
features form a multidimensional trajectory which progresses in a (locally
or globally) nonstationary manner. We also observe that around the crisis
point the behavior appears characterized by increased variations around the
same general trend (USD IRS) or by variations around a decreasing general

trend. In this section we propose an instability indicator of this phase which
can be applied to all three cases.
Several model-free (leading) indicators have been previously developed
for time-series data in order to detect an (upcoming) tipping point for
complex systems in general [31, 33, 34], such as for ecosystems, climate
systems, and biological systems. By model-free we mean that the indicator
is not developed especially for a particular dataset or domain and can easily
be extended to other domains. These model-free instability indicators are
computed from time-series and include critical slowing down, variance, and
correlation. However the financial system remains notoriously resilient to
such analyses. One possible explanation for this is that the financial system
has a tight feedback loop on its own state, and any known indicator with
predictive power would soon be exploited and thus lead to behavioral
changes in the market as long as it is present.
Regardless of the underlying reason, it has been shown that well-known
model-free instability indicators hardly or not at all detect or predict the
financial crisis. For instance, critical slowing down, variance, and correlations
do not form a (leading) indicator in the same IRS data [11]. Babecký et al.
[35] find that currency and banking crises are also hard to predict and resort
to combinations of model-specific measures such as worsening government
balances and falling central bank reserves. The only promising exception
known to the present authors as a leading (early-warning) indicator is one
based on specific network-topological features of interbank credit exposures
(binary links), which shows detectable changes several years before the crisis
[36]. Although it is specifically developed for interbank exposures, it could
potentially be generalized to other complex systems as well, in cases where
similar network data can be inferred. This remains nevertheless untested.
All in all, the lack of progress inspired a number of renowned experts in
complexity and financial systems [37] recently to call for an increased effort
to bring new complexity methods to financial markets in order to better
stabilize our financial markets.
Here we develop a new tipping point indicator for multidimensional
data and test it on the sequence of information features of the three
datasets. The currently well-known indicators are developed for univariate,
stationary time-series and thus cannot be directly applied to (nonstationary)
multivariate time-series, such as our information features.
We aim to generalize upon the idea of the variance indicator [34] which
appears the most feasible candidate for multidimensional time-series. In
contrast, computing critical slowing down involves computing correlations,
which requires a large, combinatorially increasing amount of data as the
number of dimensions grows. In short, the idea of the variance indicator is that
prior to a tipping point the stability of the current attractor decreases, leading
to larger variation in the system state, until the point where the stability
is sufficiently low such that natural variation can “tip” the system over to
a different attractor. This indicator is typically applied to one-dimensional
system states, such as species abundance in ecosystems or carbon dioxide
concentrations in climate, where the behavior in each attractor is assumed to
be (locally) stationary.
A natural generalization of variance (or standard deviation) to higher
dimensions is the average centroid distance: the average Euclidean distance
of a set of points to their average (centroid). Since the centroid distance also
increases when there is a directed general trend, which we wish to consider
as natural behavior, we divide by the distance traversed by this general trend.
The result in words is then the average centroid distance per unit length of
trend. That is, for a sequence of state vectors , in
our case information features, our indicator is defined as
(11)
Here, is the number of data points up to time t used in order to
compute the indicator value at time t. Ideally, should typically be as
low as possible in order to provide an accurate local description of the
system’s stability near time t, but not too low such that mostly noise effects
are measured and/or the general trend cannot be distinguished effectively.
To further filter out noise effects and study the indicator progression on
different time scales, we use an averaging sliding window of g preceding
indicator values to finally compute the indicator value at time t; that is,
(12)
Note that using these two subsequent sliding windows ( and g) is not
equivalent to simply increasing by g and then not averaging. To illustrate,
imagine that the multivariate time-series forms a circle of +g points.

Using a small value relative to g will recover the fact that the circle is
locally (almost) a straight line for each time t (low value in (11)), after which
taking an average of g indicator values will result in a relatively low value
in (12). In the extreme case of a straight line with uniformly spaced points,
. In contrast, increasing by g and setting g=1 means that
the point xt returns back adjacent to (so small denominator value in
(11)) but a large average distance to the centroid (the radius of the circle). In
the extreme case of . This example makes
it clear that is preferably as low as possible to capture the short-term
behavior; since this decreases the signal-to-noise ratio, we subsequently
average over g most recent values.
Figure 6 shows the indicator values for all three datasets during
roughly 7 (IRS) and 12 years (FX) around the crisis date. Strikingly, the
short-term plots (Figure 6(a), g=10) show that in all three cases there is a
strong and sharp peak around the time of the crisis. For the FX market this
peak precedes the crisis by almost one year; for the IRS market the peaks
are just after the crisis (USD, 2-3 months) or well after the crisis (EUR,
almost one year). Although there are also a few other peaks (EUR and FX),
briefly discussed further below, it is reassuring that the indicator is capable
of clearly detecting the financial crisis period around the Lehman Brothers
collapse. The difference in timing is intriguing but not further studied here.
IRS FX
1.0
Centroid distance norm. (g = 10)
0.8
0.6
0.4
0.2
−1000 −750 −500 −250 0 250 500 750 −1000 −500 0 500 1000 1500 2000
Trade days since Lehman Brothers bankruptcy
USD
EUR
(a)
IRS FX
0.7
0.6
0.5
0.4
0.3
Centroid di
0.4
0.2
−1000 −750 −500 −250 0 250 500 750 −1000 −500 0 500 1000 1500 2000
378 Data
USD Analysis and Information Processing
EUR
(a)
IRS FX
0.7
0.6
0.5
0.4
0.3
−1000 −750 −500 −250 0 250 500 750 −1000 −500 0 500 1000 1500 2000
USD
EUR
(b)
Figure 6: The proposed instability indicator (12) calculated for all three datas-
ets. (a) is for small sliding window size (g=10) and (b) for a long sliding win-
dow size (g=50), showing the indicator both on short and long time scales. The
indicator is computed from the sequence of 200 information feature vectors in
Figures 4 and 5. The sliding window size (g) is illustrated by the gray bar in the
top left of the right-hand panels. The first g-1 indicator values are averages of
fewer than g preceding values. One year corresponds to about 250 trade days.
Note that the x-axes left and right are different. =10.
Note that the w indicator values have an intuitive interpretation. A value
of 1/4 means that the multivariate time-series progresses in a perfectly
straight line with uniform spacing. At another extreme, if the points are
perfectly distributed in a symmetrically round cloud around the initial
point, then w tends to unity on average. If there is a directed trend but the
orthogonal deviations are larger than the trend vector, then .
It is common to study instability indicators at a larger time scale in
order to detect or even predict the largest events, ignoring (smoothing out)
smaller events. In particular the hope is to find a leading indicator which
could be used to anticipate the 2008 onset of recession. We show the same
indicator but now averaged over g=50 values in Figure 6(b). Remarkably, all
three datasets show a discernible, long-term steady growth in the instability
indicator leading up and through the crisis date. For the EUR and FX curves
this growth starts around two years before the crisis; for the USD curve the
growth starts at the start of the curve. Although here the initial peak in the
EUR curve appears to even outweigh the crisis-time peak, we must note
here that this peak is subject to less smoothing as there are fewer than g
values to the left available for averaging (compare with the sliding window
size depicted in gray); with further averaging all other peaks will continue
to decrease in height (cf. Figure 6(a)), whereas this initial peak will remain
roughly at the same value for this reason.
We will now discuss two significant additional strong peaks observable
in the indicator curves: an initial peak in EUR (around August-September
2004) and a late peak in FX (mid-2016). We caution that it is hardly scientific
to reason back from observed peaks toward potential underlying causes,
especially for continually turbulent systems such as the financial markets
where events are easy to find. Nevertheless it is important to evaluate
whether the additional peaks at least potentially could indicate substantial
systemic instabilities, or whether they appear likely to be false positives.
For the EUR initial peak we refer to ECB’s Euro Money Market Study,
May 2004 report. We find that this report has indeed an exceptionally negative
sentiment compared to other years, speaking of “declining and historically
low interest rates,” an inverted yield curve, “high geopolitical tensions in
the Middle East and the associated turbulence in oil prices and financial
markets,” and “growing pessimism with regard to economic growth in the
euro area.” Also: “The ECB introduced some changes to its operational
framework which came into effect starting from the main refinancing
operation conducted on 9 March 2004.” In contrast, the subsequent report
(2006) is already much more optimistic: “After two years of slow growth,
the aggregated turnover of the euro money market expanded strongly in
the second quarter of 2006. Activity increased across all money market
segments except in cross-currency swaps, which remained fairly stable.”
We deem it at least plausible that the initial EUR indicator peak, which has
about half the height of the after-crisis peak, is a true positive and detects
indeed a period of increased systemic instability or stress.
For the more recent FX peak across 2016 we must refer to news articles
such as in Financial Times [38, 39]. Firstly there were substantial and largely
unanticipated political shifts, including Brexit (dropping Sterling by over
20%) and the election of Trump as US President. At the same time, articles
mention fears about China’s economic growth slowing down. Lastly, as
interest rates affect associated currencies: “By August, global [bond yield]
benchmarks were at all-time lows, led by the 10-year gilt yielding a paltry
0.51 per cent, while Switzerland’s entire bond market briefly traded below
zero. […] The universe of negative yielding debt had swollen to $13.4tn.”
For example, earlier in the year (January 29), the Bank of Japan unexpectedly
started to take their interest rates into the negative for the first time, affecting
the Yen. Questions toward another recession are mentioned, although also
discarded. All in all, we deem it at least plausible that the FX market’s
indicator peak in 2016 could be caused indeed by systemic instability and
stress, that is, a true positive.
All in all we deem the proposed “normalized centroid distance” instability
indicator as a high potential candidate for multivariate, nonstationary time-
series. Secondly we argue that parameterizing a market state in terms of
information features instead of the original observations (interest or exchange
rates) is useful and enables detecting growing systemic instability. However
we must caution that our financial data only contains one large-scale onset
of recession (2008), so it is difficult to provide conclusive validation that
such events are detected reliably by the proposed indicator. Future work
may include applying the indicator to different simulated systems which can
be driven toward a large-scale regime shift.
DISCUSSION
Our working assumption is that dynamical systems inherently process
information. Our leading hypothesis is that the way that information is
locally processed determines the global emergent behavior. In this article
we propose a way to quantitatively characterize the notion of information
processing and assess its predictive power of the Wolfram classification of
the eventual emergent behavior of ECA. We also make a “leap of faith”
to real (financial) time-series data and find that transforming the original
time-series to an information features time-series enables detection of the
2008 financial crisis by a simple leading indicator. Since it is known that the
original data does not permit such detection, this suggests that novel insights
may be gained even in real data of complex systems, despite not obeying
the ideal conditions of our ECA approach. This warrants a further systemic
study into this notion of information processing in different types of models
and eventually datasets.
Our formalization builds upon Shannon’s information theory, which
means that we consider an ensemble of state trajectories rather than a single
trajectory. That is, we do not quantify the information processing that occurs
during a particular, single sequence of system states (attempts to this end
are followed by Lizier et al. [40]). Rather, we consider the ensemble of all
possible state sequences along with their probabilities. One way to interpret
this is that we quantify the “expected” information processing averaged over
multiple trajectories. Another way to interpret it is that we characterize a
dynamical model in its totality, rather than a particular symbolic sequence

of states of a model. Our reasoning is that if (almost) every state trajectory
of a model (such as a CA rule) leads to a particular emergent behavior (such
as chaotic or complex patterning), then we would argue that the emergent
behavior is a function of the ensemble dynamics of the model.
This seems at odds with computing information features from real
time-series, which are measurements of a single trajectory of a system.
We resolve this issue by assuming “local stationarity.” This assumption is
common in time-series analysis and used (implicitly or explicitly) in most
“sliding window” approaches and moving statistic estimations, among
others. In other words, we assume that the rate of sampling data points is
significantly faster than the rate at which the underlying statistical property
changes, which in our case are the information features. The consequence
is that a finite number of consecutive data points can be used to estimate the
probability distribution of the system at the corresponding time, which in
turn enables estimating mutual information quantities.
Our first intriguing result from the ECA analysis is that fewer information
features capture more of the relevant dynamical behavior, as time t progresses
away from a randomized system state. One potential explanation is the
processing of correlated states or, equivalently, of overlapping information.
Namely, to reach at t=1 each cell operates exclusively on uncorrelated inputs,
so the resulting state distribution is a direct product of the state transition
table, irrespective of how cells are connected to each other. Neighboring cell
states at t=1 are correlated due to overlapping neighborhoods in this network
of connections. Consequently, at time t=2 and beyond, the inputs to each cell
have become correlated in a manner dictated by the interaction topology.
The way in which an ECA rule subsequently deals with these correlations
is evidently an important characteristic. In other words, two ECA rules may
have exactly the same information features for t=1 but different features for
t=2 which must be due to a different way of handling correlated states.
This result leads us to hypothesize an information locality concept.
That is, the information features at t=1 of each cell do not yet characterize
the interaction topology (by the correlations it induces). In other words, all
interaction topologies where each cell has 3 neighbors are indistinguishable
at t=1. This suggests that the “missing” predictive power at t=1 is a measure
of the relevance of the interaction topology. In the case of ECA this quantity
is roughly half: 1-0.49=0.51, where 0.49 is the maximal predictive power at
t=1. For the sake of illustration, suppose that the eventual behavior exhibited
by a class of systems depends crucially and only on the number of neighbors

at distance 5. In this case we expect that the predictive power of the
information features does not reach 1.0 until at least , since otherwise
the cell states have not had the opportunity to be causally influenced by
the local network structure at distance 5. If no further network effects at
larger distances play a role, then we expect the predictive power to reach
exactly 1.0 at t=5. For ECA we find 1.0 predictive power already at t=3,
suggesting that there are no nonlocal network features which play a role in
the dynamics, which is indeed true for ECA (all ECA are uniform, infinite-
length line graphs). We hypothesize that the distance, that is, time step t at
which the predictive power no longer increases, is a measure of the “locality”
of topological characteristics that are relevant for the emergent behavior.
However we leave it for future work to fully investigate this concept.
Our second intriguing result is that the most predictive information
feature is invariably synergy. In each time step it accounts for the vast
majority of the total predictive power (75%, 92%, and ,96% resp.). This
is the feature that we would consider to actually capture the “processing”
or integration of information, rather than the memory and transfer features
which capture the simple “copying” of information. Indeed, the cube of
Figure 1 suggests that the interesting behaviors (chaotic and complex) are
associated with high synergy, low memory, and low transfer. In this extreme
we find rule 60 (the XOR rule) and similar rules which are all chaotic rules.
For complex behavior nonzero but low memory and transfer appear to be
necessary ingredients.
The good separation of the dynamic behavioral classes in the ECA
models using only a few information features ultimately leads to the
question whether the same can be done for real systems based on real data.
This is arguably a large step and certainly more rigorous research should be
done using intermediate models of increasing complexity and for different
classifications of dynamical behavior. On the other hand, if promising results
could be obtained from real data using a small set of information features,
then this would add more urgency to such future research, even if the role
of information processing features in systemic behavior is not yet fully
understood. This is the purpose of our application to financial data. Financial
data is of high quality and available in large quantities and at least one large
regime shift is known, namely, the 2008 financial crisis. We stay as close as
possible to ECA by selecting two datasets in which the dynamical variables
could be interpreted to form a line graph. Although of course many forces
which act on the markets are not contained in these datasets, the information
features may still be able to detect changes in the correlations over time,
despite not knowing what is the root cause of these changes. In fact, a
primary driver behind our approach is indeed the abstraction of physical or
mechanistic details while still capturing the emergence of different types
of behaviors. We consider our results in the financial application promising
enough to warrant further study into information processing features in
complex system models and other real datasets. Our results suggest tipping
point behavior for the FX and EUR IRS markets and a possible driving role
for the USD IRS market.
All in all we conclude that the presented information processing concept
appears indeed to be a promising direction for studying how dynamical
systems generate emergent behaviors. In this paper we present initial results
which support this. Further research may identify concrete links between
information features and various types of emergent behaviors, as well as
the relative impact of the interaction topology. Our lack of understanding
of emergent behaviors is exhibited by the ECA model: it is arguably the
simplest dynamical model possible, and the choice of local dynamics
(rule) and initial conditions fully determine the emergent behavior that is
eventually generated. Nevertheless even in this case no theory exists that
predicts the latter from the former. The information processing concept may
eventually lead to a framework for studying how correlations behave in
dynamical systems and how this leads to different emergent behaviors.
ACKNOWLEDGMENTS
Peter M. A. Sloot and Rick Quax acknowledge the financial support of
the Future and Emerging Technologies (FET) Programme within Seventh
Framework Programme (FP7) for Research of the European Commission,
under the FET-Proactive grant agreement TOPDRIM, no. FP7-ICT-318121.
All authors also acknowledge the financial support of the Future and
Emerging Technologies (FET) Programme within Seventh Framework
Programme (FP7) for Research of the European Commission, under the
FET-Proactive grant agreement Sophocles, no. FP7-ICT-317534. Peter M.
A. Sloot acknowledges the support of the Russian Scientific Foundation,
Project no. 14-21-00137.
REFERENCES
1. T. M. Cover and J. A. Thomas, Elements of Information Theory, vol. 6,
John Wiley & Sons, 1991.
2. C. G. Langton, “Computation at the edge of chaos: phase transitions
and emergent computation,” Physica D: Nonlinear Phenomena, vol.
42, no. 1–3, pp. 12–37, 1990.
3. P. Grassberger, “Toward a quantitative theory of self-generated
complexity,” International Journal of Theoretical Physics, vol. 25, no.
9, pp. 907–938, 1986.
4. J. T. Lizier, M. Prokopenko, and A. Y. Zomaya, “Information
modification and particle collisions in distributed computation,”
Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 20, no.
3, Article ID 037109, 2010.
5. J. T. Lizier, M. Prokopenko, and A. Y. Zomaya, “The information
dynamics of phase transitions in random boolean networks,” in
Proceedings of the 11th International Conference on the Simulation
and Synthesis of Living Systems: Artificial Life XI, ALIFE 2008, pp.
374–381, 2008.
6. R. D. Beer and P. L. Williams, “Information processing and dynamics
in minimally cognitive agents,” Cognitive Science, 2014.
7. E. J. Izquierdo, P. L. Williams, and R. D. Beer, “Information flow
through a model of the C. elegans klinotaxis circuit,” https://arxiv.org/
ftp/arxiv/papers/1603/1603.03552.pdf.
8. Y. Bar-Yam, D. Harmon, and Y. Bar-Yam, “Computationally tractable
pairwise complexity profile,” Complexity, vol. 18, no. 5, pp. 20–27,
2013.
9. B. Allen, B. C. Stacey, and Y. Bar-Yam, “An Information-Theoretic
Formalism for Multiscale Structure in Complex Systems,” https://
arxiv.org/abs/1409.4708v1.
10. R. Quax, A. Apolloni, and P. M. A. Sloot, “The diminishing role of
hubs in dynamical processes on complex networks,” Journal of the
Royal Society Interface, vol. 10, no. 88, 2013.
11. R. Quax, D. Kandhai, and P. M. A. Sloot, “Information dissipation as
an early-warning signal for the Lehman Brothers collapse in financial
time series,” Scientific Reports, vol. 3, article no. 1898, 2013.
12. K. Lindgren, “An information-theoretic perspective on coarse-graining,

including the transition from micro to macro,” Entropy, vol. 17, no. 5,
pp. 3332–3351, 2015.
13. R. G. James, C. J. Ellison, and J. P. Crutchfield, “Anatomy of a bit:
Information in a time series observation,” Chaos: An Interdisciplinary
Journal of Nonlinear Science, vol. 21, no. 3, Article ID 037109, 2011.
14. P. L. Williams and R. D. Beer, “Nonnegative decomposition of
multivariate information,” https://arxiv.org/abs/1004.2515v1.
15. E. Olbrich, N. Bertschinger, and J. Rauh, “Information decomposition
and synergy,” Entropy, vol. 17, no. 5, pp. 3501–3517, 2015.
16. R. Quax, O. Har-Shemesh, and P. M. A. Sloot, “Quantifying synergistic
information using intermediate stochastic variables,” Entropy, vol. 19,
no. 2, article no. 85, 2017.
17. G. Chliamovitch, B. Chopard, and L. Velasquez, “Assessing
complexity by means of maximum entropy models,” https://arxiv.org/
abs/1408.0368.
18. V. Griffith, E. K. P. Chong, R. G. James, C. J. Ellison, and J. P.
Crutchfield, “Intersection information based on common randomness,”
Entropy, vol. 16, no. 4, pp. 1985–2000, 2014.
19. V. Griffith and T. Ho, “Quantifying redundant information in predicting
a target random variable,” Entropy, vol. 17, no. 7, pp. 4644–4653, 2015.
20. S. Wolfram, A New Kind of Science, Wolfram Media, Champaign, Ill,
USA, 2002.
21. R. D. Beer and P. L. Williams, “Information processing and dynamics
in minimally cognitive agents,” Cognitive Science, vol. 39, no. 1, pp.
1–38, 2015.
22. N. Timme, W. Alford, B. Flecker, and J. M. Beggs, “Synergy, redundancy,
and multivariate information measures: an experimentalist’s
perspective,” Journal of Computational Neuroscience, vol. 36, no. 2,
pp. 119–140, 2014.
23. E. Schneidman, W. Bialek, and M. J. Berry II, “Synergy, Redundancy,
and Independence in Population Codes,” The Journal of Neuroscience,
vol. 23, no. 37, pp. 11539–11553, 2003.
24. Wolfram—Alpha: Computational Knowledge Engine, 2015. http://
www.wolframalpha.com/.
25. R. Battiti, “Using mutual information for selecting features in supervised

neural net learning,” IEEE Transactions on Neural Networks and
Learning Systems, vol. 5, no. 4, pp. 537–550, 1994.
26. T. W. S. Chow and D. Huang, “Estimating optimal feature subsets
using efficient estimation of high-dimensional mutual information,”
IEEE Transactions on Neural Networks and Learning Systems, vol.
16, no. 1, pp. 213–224, 2005.
27. A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual
information,” Physical Review E, vol. 69, no. 6, article 066138, 2004.
28. http://www.global-view.com/forex-trading-tools/forex-history/.
29. I. Culik and S. Yu, “Undecidability of CA classification schemes,”
Complex Systems, vol. 2, no. 2, pp. 177–190, 1988.
30. K. Sutner, “Computational classification of cellular automata,”
International Journal of General Systems, vol. 41, no. 6, pp. 595–607,
2012.
31. M. Scheffer, J. Bascompte, W. A. Brock et al., “Early-warning signals
for critical transitions,” Nature, vol. 461, no. 7260, pp. 53–59, 2009.
32. M. Melvin and M. P. Taylor, “The crisis in the foreign exchange
market,” Journal of International Money and Finance, vol. 28, no.
8, pp. 1317–1330, The Global Financial Crisis: Causes, Threats and
Opportunities, 2009.
33. V. Dakos, M. Scheffer, E. H. Van Nes, V. Brovkin, V. Petoukhov, and
H. Held, “Slowing down as an early warning signal for abrupt climate
change,” Proceedings of the National Acadamy of Sciences of the
United States of America, vol. 105, no. 38, pp. 14308–14312, 2008.
34. V. Dakos, E. H. van Nes, R. Donangelo, H. Fort, and M. Scheffer,
“Spatial correlation as leading indicator of catastrophic shifts,”
Theoretical Ecology, vol. 3, no. 3, pp. 163–174, 2010.
35. J. Babecký, T. Havránek, J. Matějů, M. Rusnák, K. Šmídková, and B.
Vašíček, “Banking, debt, and currency crises in developed countries:
Stylized facts and early warning indicators,” Journal of Financial
Stability, vol. 15, pp. 1–17, 2014.
36. T. Squartini, I. Van Lelyveld, and D. Garlaschelli, “Early-warning
signals of topological collapse in interbank networks,” Scientific
Reports, vol. 3, article no. 3357, 2013.
37. S. Battiston, J. D. Farmer, A. Flache et al., “Complexity theory and

financial regulation: Economic policy needs interdisciplinary network
analysis and behavioral modeling,” Science, vol. 351, no. 6275, pp.
818-819, 2016.
38. The big events that shook financial markets in 2016. https://www.
ft.com/content/6d24125c-c066-11e6-9bca-2b93a6856354.
39. Major Economic Events 2016: What Moved the Markets This Year?
https://www.orbex.com/blog/2016/12/major-economic-events-2016-
moved-markets-year-infographics.
40. J. T. Lizier, The Local Information Dynamics of Distributed
Computation in Complex Systems, Springer, 2012.
INDEX
A Big data technology 202

Biomedical engineering 328
Amazon 56, 65 Business intelligence (BI) 87, 94
Analytics software 159 Business intelligence software 159
Application programming interface Business organizations 188
(API) 259
Artificial intelligence 4, 6, 15 C
Artificial neural network (ANN) 5,
Cellular immunity 104
6, 17
Cloud computing 259
Audit analysis model 202
Cloud computing technology 203
Audit Data Analytics (ADAs) 95
Cloud storage 259
Audit firms 91, 92
Clustering 217, 261
B Clustering analysis 277
Communication 314, 318, 324, 325
Backpropagation (BP) algorithm Comprehensive analysis 216
238 Coordination process performance
Basic local alignment search tool 319, 320
(BLAST) 108 Cost reduction 157
Basket Analysis 265 Customer Profiling 265
B-cell 104, 105, 106, 107, 108, 110, Customer relationship management
112, 113, 114, 115, 116, 117, (CRM) 94
118, 120 Customer relationship management
Big data 4, 10, 15, 19, 20, 21, 22 (CRM) systems 94
Big data analytics 87, 88 Cybersecurity 25, 26
Big data framework 189 Cytotoxic T-lymphocytes (CTLs)
Big Data Science 189 104
D Education 215, 216, 217, 218, 221,

222, 223, 224, 227
Data acquisition 102
Electronic health record (EHR) 235
Data Analytics 25, 26
Electronic measuring instruments
Database Management Systems 55,
328
56, 58, 59, 82
Elementary cellular automata (ECA)
Data clustering 277
354, 355, 357
Data information acquisition 202
Entity-Relationship (ER) 59, 61
Data, information, knowledge, and
Entity Relationship (ER) Model 61
wisdom (DIKW) 102
Epidermal growth factor receptor
Data mining 159, 216, 217, 220,
(EGFR) 108, 110
221, 222, 224, 227
Ethnography 174, 175
Data mining approaches 232
ETL (Extract-Transform-Load) 192
Data mining technology 203
Euclidean distance 297, 298
Datanodes 275
Extensible storage system 203
Data selection 264
Data transformation 264 F
Data Warehouses 55, 56, 58, 61
Facebook 56, 73, 81, 126
Decision support databases 56, 57,
Foreign exchange (FX) 354, 364
61, 77, 78, 79
Decision Tree 216, 217, 218, 219, G
226, 227
Gaussian transformation 296
Deep learning 159
Genetic clustering 329, 345, 346
Deep neural network (DNN) 92
Google 56, 67
Deletion 234
Denial of service (DoS) 31 H
Digital informatization 289
Digital sports 292, 309 Hadoop Distributed File System
Distributed database 203 (HDFS) 195
Distributed file system 275 Healthcare organizations 4
Domain-specific languages (DSLs) Healthcare system 4, 6, 19, 20
57 Health services 215
Drug development 233 Hospital information system 233
Dynamical systems 353, 354, 355, Human immune system 103, 104
357, 362, 380, 383 Human Immunology Project Con-
sortium (HIPC) 105
E Human leukocyte antigens 104
Hypothesis 354, 362, 364, 367, 368,
E-commerce 257, 258, 259, 260,
380
262, 263, 264, 265, 267, 270
Index 391
I Marketing information system

(MIS) 171
Image processing 328
Market Segmentation 267
Immune Epitope Database (IEDB)
Massachusetts All Payer Claim Data
104, 106
(MA APCD) 8
Information circulation 291
Massive information 203
Information exchange 314, 315,
Massively parallel processing data-
319, 321
base 203
Information platform 290, 291
Mass spectrometry (MS) 105
Information processing (IP) 313,
Medical knowledge 231, 233, 238,
321
239
Information Technology 187, 193
Medical text data 231, 232, 233,
Infrastructure as a Service (IaaS)
234, 235, 236, 237, 238, 239,
259
242, 243, 244, 245, 246, 247
Interdisciplinary communication
Mental health 3, 4, 9
frequency 318
Mental health research 3
Interest-rate swap (IRS) 354, 364
Merchandise planning 266
Internet 172, 174, 184
Missing at random (MAR) 26
Internet of things (IoT) 35
Missing completely at random
Interpreters 275
(MCAR) 26
K Missingness mechanisms 26, 27,
28, 34
K-nearest neighbor classifier 218
Model-Driven Engineering (MDE)
Knowledge-based systems (KBSs)
57
106
Multi-family housing complex
L (MFHC) 217, 219
Multiple imputation by chained
Latent Dirichlet Allocation (LDA) equations (MICE) 27, 28
128 Multiple sequence alignment (MSA)
M 108
Machine learning 4, 5, 6, 8, 9, 10, N
11, 13, 14, 15, 17, 20, 21, 22 Naïve Bayes algorithm 216, 222,
Machine learning algorithms 88, 89, 227
93 Naïve Bayes classifier 218
Major histocompatibility complex Namenode node 275
(MHC) 104 National Health Service (NHS) 6
market basket analysis (MBA) 265 Natural language 231, 233, 234,
Marketing 169, 173, 174, 175, 177, 235, 236, 246
183, 184, 185
Natural language processing (NLP) Seismic prospecting 328

128, 129 Semi-structured data 155
Netnography 174, 184 Sentiment analysis 126
Neural network 327 Shannon information theory 355
Neural network theory 327 Social Media Data Stream Sentiment
Nonlocal mutual information 359 Analysis Service (SMDSSAS)
125, 129, 131
O
Social network data 129
Online Social Network (OSN) 219 Social networks 175, 177, 178, 179,
Operational databases 56, 57, 59, 181
67, 76, 77, 78, 79 Software as the Service (SaaS) 259
Solid information system 170
P
Sonar 328
Perceived information quality (PIQ) Statistical data analysis 314
316 Stochastic variables 356
Personalization 265 Support Vector Machine (SVM)
Personalized psychiatry 4 218, 219, 226
Petrophysical data 273, 274, 275, Surveillance 25
276, 278, 279, 280, 282, 283
T
Platform as a Service (PaaS) 259
Prediction 261 T-cell 104, 105, 106, 107, 108, 110,
Predictive analytics 89, 94, 159, 111, 112, 113, 116
160, 161 Twitter 126, 129, 130, 131, 132,
Processing Elements (PEs) 196 135, 142, 143, 148
Processing Nodes (PN) 196
U
Projection 295, 296, 297
Unstructured data 155, 161, 164
R
V
Radar 328
Recurrent neural network (RNN) Velocity 191, 198
235 Vibration engineering 328
Relational Database Management Volume 155, 156, 190, 191, 198
Systems (RDBMSs) 56, 59
W
Relational Model 59, 60, 61, 67, 82
Robotics automation processing Whole-minus-sum (WMS) 359
(RPA) 89
Y
S
Yahoo 56
Sales forecasting 266

Data Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis

Uploaded by

Copyright:

Available Formats

Data Analysis and Information

e-book Edition 2023

ISBN: 978-1-77469-579-1 (e-book)

© 2023 Arcler Press

ISBN: 978-1-77469-526-5 (Hardcover)

Jovan currently works as a presales Technology Consultant at Dell Technologies. He

Section 1: Data Analytics Methods

Chapter 1 Data Analytics in Mental Healthcare.................................... 3

Chapter 2 Case Study on Data Analytics and Machine Learning Accuracy............... 25

Chapter 5 Big Data Analytics in Immunology: A Knowledge-Based Approach....... 101

Section 2: Big Data Methods

Chapter 7 The Influence of Big Data Analytics in the Industry............................... 153

Chapter 8 Big Data Usage in the Marketing Information System............................ 169

Chapter 9 Big Data for Organizations: A Review.................................................... 187

Chapter 10 Application Research of Big Data Technology in Audit Field................. 201

Section 3: Data Mining Methods

Chapter 11 A Short Review of Classification Algorithms Accuracy for Data

Chapter 13 Data Mining in Electronic Commerce: Benefits and Challenges............. 257

Section 4: Information Processing Methods

Chapter 15 Application of Spatial Digital Information Fusion Technology in

Chapter 16 Effects of Quality and Quantity of Information Processing

Chapter 17 Neural Network Optimization Method and its Application

Chapter 18 Information Processing Features Can Detect Behavioral Regimes of

Ayesha Kamran Ul haq

Alberto Rodrigues da Silva

Mui Kim Chu

Guang Lan Zhang

Sun Sunnie Chung

Alexandre Borba Salvador

Ana Akemi Ikeda

Pwint Phyu Khine

Mohammed Mansur Ibrahim

Zayyan Mahmoud Sanusi

EMC: The U.S.A EMC company

Data Analytics in Mental Healthcare

personality disorders, are discussed. The effects of mental health on user’s

for an affordable solution to detect depression in Pakistan so that everyone

and developed a wearable device with multisensing capabilities including

MENTAL ILLNESS AND ITS TYPE

Depression and Bipolar Disorder

Table 1: Types of mental illness and role of big data

Authors Discipline(s) Keywords used to Methodology Number Primary findings

Bauer et al. Bipolar Bipolar disorder, Paper-based 68 47% of older adults

Dhaka and Mental Mental health, Genetic 19 Analyzing and

Kumar and Depression Big data, Hadoop, Sentimental 14 Analyzing twitter

De Mont- Mobile Personality predic- (i) Entropy: 31 Analyzing phone

Bleidorn Personality Machine learning, (i) Machine 65 Focusing on

The main goal of personalized psychiatry is to predict bipolar disorder

Figure 1: Goals of personalized treatment in bipolar disorder [22].

EFFECTS OF MENTAL HEALTH ON USER

Authors Side effects of Tools/techniques Primary findings

Kessler et al. Suicide and Machine learning Predicting suicide risk at

Cleland et al. Antidepressant Clustering analysis Identifying the correla-

economic deprivation, depression prevalence, and antidepressant prescribing

HOW DATA SCIENCE HELPS TO PREDICT MENTAL

statistics structures for patient’s evaluation. In this context, system learning

Artificial Intelligence and Big Data

Table 3: Data analytics and predicting mental health

Figure 2: AI and ML [24].

Prediction through Smart Devices

Role of Social Media to Predict Mental Illness

Key Challenges to Big Data Approach

of resources. It required an active role of technological companies and