Professional Documents
Culture Documents
Big Data Technology: Developments in Current Research and Emerging Landscape
Big Data Technology: Developments in Current Research and Emerging Landscape
Nitin Singh
To cite this article: Nitin Singh (2019): Big data technology: developments in current research and
emerging landscape, Enterprise Information Systems, DOI: 10.1080/17517575.2019.1612098
Introduction
There have been a number of recent developments in data extraction, storage, and
analysis technologies (IDC 2017), furthering the need of businesses to harvest data. Data
exist today in a variety of formats, sizes, and types (i.e. structured, semi-structured, and
unstructured). Thus, the application of data analytics is considerably more extensive
than in the past, and the software that uses quantitative methods for data analytics has
become more and more sophisticated. Data analytics is a quantitative discipline and has
been evolving since ‘Analytics 1.0ʹ first appeared in the mid-1950s (Davenport 2013). As
the Internet of Things (IoT), connected devices, sensors, and smart machines have
become more widespread, the ability of devices and machines to generate new types
of real-time information in the industry’s value stream has been growing. In fact,
organisations now expect increasingly competitive and convenient cloud-based data
options that have on-demand pricing and fit-for-purpose data processing options
(Gartner 2017). A typical resource in these functional areas would be expected to be
able to accommodate large volumes of data, while the enormity of data necessitates
clear strategies for storage and management and for deriving patterns and meaningful
insights. Thus, enterprises are considering big data technologies as an important part of
their information infrastructure.
Some studies have noted that big data initiatives are fairly large. According to
a recent International Data Corporation report, global sales for big data and business
analytics are expected to grow from USD 150.8 billion in 2017 to more than USD
210 billion in 2020 (IDC 2017). This translates to a compound annual growth rate of
11.7%. Regarding sectors, banking has emerged as the industry with the largest invest-
ment in big data and business analytics solutions (approximately USD 17 billion in 2016).
In this context, it is no surprise that current research is interwoven around fundamental
questions about big data and how best to use it for business. The rapid growth in the
adoption of big data technologies has attracted research on its managerial, technologi-
cal, and societal ramifications (Mishra et al. 2017; Kalantari et al. 2017; Qazi and Sher
2016; Ghosh 2016; Phillips-Wren et al. 2015). Numerous reviews have demonstrated the
maturity and intellectual structure of the field. As such, the objective of this paper is to
crystallise the current research themes that have emerged from extant literature ana-
lyses and reviews. We believe that this research contributes to the current literature in
several ways. First, the study assesses the current state of research and extracts the
quantitative evidence on extant subjective or narrative big data reviews. Second, the
study employs a principal component analysis (PCA) and a bibliometric analysis to
capture the intellectual structure that emerged from the review. Third, research like
this current study is rare; therefore, there is a need for additional studies like this to be
conducted so that others may understand and build on the evolving base of big data
knowledge.
were pivotal and drove the research. This study of Kalantari et al. (2017) helped describe
the landscape of big data research and provided other researchers avenues for future
research. It employed a multivariate regression approach and included variables such as
the numbers of pages, references, authors, and citations. Mishra et al. (2017) used
a similar direction and approach; this study employed a bibliometric analysis to under-
stand the trends and challenges involved with big data, using a citation and co-citation
analysis to assess papers published from 2011 to 2015 in 10 selected journals. The work
conducted by Mishra et al. (2017) was instrumental in highlighting the literature.
Moreover, the study identified an increase in the number of big data papers conducted
during the study period. The studies mentioned above demonstrate the ongoing
research interest in identifying the trends and themes in the area of big data. This is
understandable, given the fact that this area is continuously evolving. Taking a range of
methodological perspectives into consideration is also important when cultivating
a comprehensive understanding of any research area. Our study contributes to these
efforts by exploring more recent articles, including those published between 2015–2017.
We conducted a narrative review of the recent literature and performed a quantitative
analysis using the previously described methods (i.e. PCA, co-citation analysis). We found
that PCA and co-citation analyses provide complementary findings, which helps con-
solidate big data insights.
responsible use of big data technologies to provide some direction in the ethical use of
big data. Date (2016) discussed the boundary condition for a business decision of
moving data into Amazon storage either online or by physical shipment. Nair (2015)
opined that approximate, timely information might be preferable to precise, delayed
information when making business decisions.
Computers in Industry
Yang et al. (2015a) reviewed state-of-the-art health-care applications and suggested that
scalable technologies in cloud computing might provide cost-effective solutions in
healthcare. Babiceanu and Seker (2016) discussed the role of big data and analytics in
managing manufacturing operations. Further, they developed a framework for cyber-
physical systems which would facilitate data collection and algorithmic analysis of data.
Multi-Dimensional Scaling). These analyses were conducted to shed light on the intel-
lectual structure of IoT studies.
In another study, the authors demonstrated the huge scope for penetration of big
data and cyber-physical systems in Industry 4.0, since these systems help improve
resource efficiency and achieve personalisation (Xu and Duan 2018). Large volumes of
data created by cyber-physical systems can be handled by big data techniques and can
therefore improve system security, scalability, and efficiency. Quality as a Service (QaaS)
models for web services have also been studied using big data technologies. For
instance, Ahmad and Sarkar (2016) showed that Quality of Service (QoS) can be used
as an input, while the QaaS model provides an output for web services that matches
user expectations (Ahmad and Sarkar 2016). It has been shown that evaluations of QoS
effectiveness can be conducted through server logs. After reviewing recent publications
in EIS – namely, ‘Big data for cyber-physical systems in Industry 4.0: A survey’ by Xu and
Duan (2018) and ‘QaaS model for web services using big data technologies’ by Ahmad &
Sarkar (2016) – we observed that these two papers focused on two different areas of
significant big data applications: cyber-physical systems and QoS. These papers were
similar to the present study in that we also consider applications of big data technology.
However, our manuscript is different in the sense that it builds on the research con-
ducted in other studies (including these two) and attempts to identify emerging
research issues and areas. Additionally, our manuscript does not examine a specific
application area such as cyber-physical systems or QoS.
Interfaces
Baughman et al. (2016) built a model to predict the volume of cloud computing
resources needed to sustain the IT needs of an organisation in the context of a live
sporting event. The benefits of the model were that it was more efficient than human
counterparts and was proactive in provisioning resources from the cloud, unlike human
decision makers, who tend to be reactive.
data context and discussed the associated opportunities and perils. The big data context
is heterogeneous, unstructured, haphazard, trans-semiotic, inductive, bottom-up, short-
horizon, and nowcasting in nature. This calls for cautious applications of big data when
devising strategy. Markus (2015) underlined the pros and cons of big data technology
with specific emphasis on data privacy. The standard notions of strategising in organisa-
tions seem to change with big data. Woerner and Wixom (2015) showed how big data
improve business models by (1) enabling the acquisition of new data, (2) providing new
insights, and (3) suggesting new actions. Big data also bring innovations to the business
model through data monetisation and digital transformation. Zuboff (2015) proposed an
alternative conceptualisation of big data as a form of surveillance capitalism, the
objective of which is to collect data about individuals and their habits for the purpose
of controlling and modifying behaviour for commerce. Bhimani (2015) commented on
the mechanism through which big data shapes strategy: by increasing the barriers to
entry; by redefining influence and organisational power; and by changing the relation-
ship between organisations and their stakeholders.
Gates et al. (2015) also suggested reordering the sequential system generation to simul-
taneously compute numerous systems. There were also studies that considered big data
analytics. For instance, one study used analytics to identify the caution spots regarding
accidents from the big data generated through vehicle recorders; the authors recom-
mended the use of a visual exploration system that enables the identification of various
types of caution spots (Itoh et al. 2015). There have also been discussions on issues related
to data quality in big data technology. In one study, the researchers proposed a scalable
approach to enhance the quality of big data by cleaning inconsistent data (Benbernou
and Ouziri 2017). Data access is a major bottleneck in big data processing. A study
conducted in this area demonstrated the use of PortHadoop in solving the cross-
platform data-access issue. Experiential results showed that PortHadoop was effective
and compatible with high-performance computing (Yang et al. 2015b).
Research methodology
The methodology we used primarily employs two different approaches: a PCA and
a bibliometric analysis (apart from keyword and subjective analyses). In the section on
ENTERPRISE INFORMATION SYSTEMS 9
interpretation, we reconciled the results of the co-citation and PCA, which together
showed the structure of the field. The objective of the PCA was to establish the under-
lying pattern in the keywords that frequently appear in the narrowed-down papers.
Next, we performed a bibliometric (citation and co-citation) analysis. Citation
analysis is a tool used to investigate the intellectual structure of a given field
(Garfield 1979). It can be used to identify seminal and influential papers and par-
ent–child relationships between source and derivative works, and is based on cita-
tions made or received by a paper (Wang et al. 2016). We performed citation analysis
to identify seminal works and key journals, after which we analysed co-citations,
which occur when two or more papers are cited together by another paper. The
higher the number of co-citations, the higher is the possibility that two papers are
semantically related to each other (Small 1973). The semantic relation that emerges
is usually strong because it reflects the opinions of a wide set of authors (Small
1973). Wang et al. (2016) used co-citation analysis to investigate the structure of the
cloud-computing domain within the IS field. Outside of IS, Pilkington and Meredith
(2009) employed citation and co-citation analysis to uncover the structure of the
operations management field.
The steps we used to gather relevant data were as follows. We considered a set of
international peer-reviewed journals and leading conferences from both manage-
ment and technology streams within the IS domain. The real hype would also
consider how attractive big data as a domain is to research. Consequently, the
analysis should also include conferences because work is disseminated more quickly
on that platform. We selectively scanned multiple prestigious conferences at which
relevant papers were recently presented, and identified journal and conference
articles using a keyword search with a publication window of 2015–2017. Articles
prior to 2015 were not considered, as the content in these articles would have
reflected information from one or two years prior to the publication year. Based on
our review of content, we shortlisted a total of 61 articles from journals and
conferences in the publication window 2015–2017. In this section, we organise the
review of the articles by journal and conferences. Table 1 provides year-wise pub-
lication counts across journals.
We did not classify the papers since we wanted to shortlist and identify papers by
citation before performing any analysis. Using these keywords, we were able to shortlist
65 papers and proceedings published in international peer-reviewed journals and con-
ferences from recent studies (01/2015–06/2018) on big data (ABDC, ERA, Qualis). We
selected only those papers that received a higher number of citations because highly
cited articles are high-value papers in their respective fields of study. It was important to
ensure that only influential articles were selected (Shiau 2016).
In the next step, the research papers’ metadata for the citation and co-citation
analysis were retrieved from Crossref, a leading worldwide database containing more
than 100 million registered content records (Crossref). Crossref also has over 7.9 million
records that contain Crossmark, which has more than 3.3 million records with funding
information and more than 2 million records that have at least one ORCID ID. Crossref
also interlinks various reputable journals, books, and scientific databases, thus allowing
for research discovery along with citation indexing. It is trusted by approximately 11,629
worldwide scholarly member organisations (source: www.crosref.org).
(Continued)
12
Table 2. (Continued).
ID Authors (Year) Source Times cited
36 Menon and Sarkar (2016) MIS quarterly 14
37 Metcalf (2016) Communications of the ACM 13
38 Mani, Shmueli, and Yahav (2015) MIS quarterly 12
N. SINGH
rankings are listed as A, B, and C, with A being the best. Approximately 20% of the IEEE
and ACM conferences fall into A categories, while 75% fall into different B categories.
Because of these high rankings, we shortlisted the IEEE and ACM conferences in our
research. The papers from IEEE and ACM conferences were identified based on
a keyword match to big data.
Keyword analyses on textual information are often used to identify and shortlist papers
(Mishra et al. 2017; Li and Duan 2018). The application of a keyword analysis can use
different approaches: conventional, directed, or summative. These applications extract
quantifiable information from textual data, and each approach uses a specific coding
scheme, coding origin, and analytical procedure. Analysis of content involves converting
unstructured to structured content so that trends and patterns can be perceived within
the content. For example, studies have created metadata from content through extrac-
tion of entities and subject classifications. Keyword analysis is an established research
method in IS research (Palvia et al. 2003; Palvia, Pinjani, and Sibley 2007).
In this study, we used a large corpus of text (the journal papers) to evaluate big data
and analytics research trends. In conventional content analysis, coding categories are
derived directly from the text data. Thus, we used summative analysis of keywords in the
content followed by the interpretation of underlying context. This approach involved
quantifying the frequency of specific keywords in each paper to identify major concerns
connected to big data. We examined papers from journals and conferences in a 3.5-year
period (01/2015–06/2018), scanning entire papers to obtain keyword statistics. A coding
sheet was used to ensure standardisation and consistency in the process of keyword
counting and to guarantee that all relevant keywords were recorded. Only keywords
explicitly used in published papers and conferences were recorded.
The keywords were based on a review of issues related to research and practice that
seem to surface often in the literature. In the ‘Literature review’ section, we discussed
these issues according to the research agenda, methodologies, and findings. We
selected the following keywords based on this review: analytics, data quality, big data,
Hadoop, data privacy, visualisation, data mining, data preparation, data cleaning, data
storage, Cloudera, and Amazon. As shown in Figure 1, critical issues that have been
discussed often in recent published articles are big data and analytics followed by data
storage, data privacy, and data mining.
Frequency
Figure 1. Keywords of frequency graph.
Where ?
Amazon
# Papers : 21
Why ?
Analytics, How ?
Visualization, Data Big Data Cloudera, Hadoop
Mining # Papers : 28
# Papers : 111
What ?
Data cleaning,
quality, storage,
quality
# Papers : 54
Table 4. Communalities.
Initial Extraction
Analytics 1.000 .601
Big data 1.000 .507
Visualisation 1.000 .517
Data mining 1.000 .691
Data preparation 1.000 .812
Data cleaning 1.000 .509
Storage 1.000 .543
Cloudera 1.000 .811
Amazon 1.000 .770
Extraction Method: Principal Component Analysis.
presence of cross loading. The variable Amazon cross-loaded on factors 1 and 2. Data cleaning
cross-loaded on factors 1 and 2, and visualisation cross-loaded on factors 2 and 3.
To obtain a clearer picture, we conducted rotations in PCA. We had to resort to
rotations, since few variables were found to be almost equally loaded by the 2nd
component, making it difficult to differentiate and interpret the components. Thus, we
conducted the rotation to more clearly interpret the components. The rule of thumb is
that a component must clearly load at least two variables. We find that the rotations did
not change the position of variables. However, the coordinates of the variable vectors
were changed. We observed that rotated component loadings provided a differentiated
loading of components (Table 7). Other rotation methods, oblimin and promax, did not
reveal patterns as interpretable as the one produced by varimax. Thus, we pursued the
analysis with varimax rotation. Varimax maximises the factor loadings and tries to
associate the variables with at most one factor, thereby simplifying the analysis.
uses various technologies to move data, for example, from a Structured Query Language
(SQL) type environment into Hadoop. The volume characteristic of big data demands
the use of distributed technologies for storage and processing. Hadoop is a scalable
framework that allows distributed storage, processing, and resource management of the
resources of a distributed ecosystem. While data preparation and Hadoop treat data
management from an engineering perspective, big data and storage could be analogous
to data storage in a big data ecosystem. On analytics, whereas it is reasonable to expect
that factor 3, ‘intelligence’, loads on to it, we found that the factor ‘big data manage-
ment’ had a higher loading on analytics than did ‘intelligence’. One plausible explana-
tion is the scope of analytics is greater than visualisation or data mining. For example,
exploratory, descriptive, and inferential analytics are supported by big data technologies
such as hive andspark in Hadoop ecosystem. Spark is native to big data ecosystems, and
it supports scalable analytics through its machine learning libraries. We infer that the
native technologies might have led to the assumption that analytics is more inherently
part of the ‘big data management’ component than of the ‘intelligence’ component.
The second component, ‘data services’, entails functional services related to data
cleaning and the service providers who offer such services. Service providers such as
Amazon offer a cloud-based infrastructure in which one may deploy a big data ecosys-
tem. Cloudera provides a management stack with packaged technologies (Hbase, Hive,
Impala, Hadoop, Pigscripts, Solr, Yarn, etc.) that can deploy and manage a distributed
storage and computing environment. The third construct, ‘intelligence’, is related to data
mining and visualisation. Activities make use of the data being managed in the big data
ecosystem and the data services offered by vendors to derive actionable insights. The
three constructs are functionally linked to each other in that ‘big data management’ and
‘data services’ are precursors for deriving any ‘intelligence’ in the big data ecosystem.
A citation analysis was performed to understand which themes are appearing often in
recent research. We retrieved metadata on individual papers from Crossref in the form of the
respective papers’ Digital Object Identifiers (DOI). Next, we activated API_KEY and then
obtained metadata on the existing Crossref DOIs. For that task, we used OpenURL, which
provides an XML representation of the metadata. The DOI files were then imported into the
VOSviewer tool for an analysis of the citations and co-citations. A citation analysis examines
the frequency, patterns, and graphs of citations in documents. It uses the pattern of citations
and links one document to another to reveal properties of the documents (Garfield 1972).
A typical aim would be to identify the most important documents in a collection. A co-
citation analysis, like bibliographic coupling, is a semantic similarity measure of documents
that makes use of citation relationships. It is the frequency with which two documents are
cited together by other documents (Small 1973).
We performed a citation and co-citation analysis as a complement to the literature
review to enhance clarity about ongoing research themes. There have been several
studies that have adopted citation and co-citation analyses to characterise and interpret
the structure as well as the dynamics of clusters. It has been found that this method
increases the interpretability of the literature (Garfield 1972). We applied a citation and
co-citation analysis with analytic and sense-making tasks by integrating network visua-
lisation, spectral clustering, automatic cluster labelling, and text summarisation.
Automatic cluster labelling and summarisation were used to augment the interpretation
of these clusters. This method focused on interconnections between authors and cita-
tion and co-citation cluster members. The software used for the analysis was VOSviewer,
a freely available computer program that was developed for constructing and viewing
bibliometric maps.
We pursued the analysis further to understand the pattern of co-citations, hoping to
obtain further insights into specific research themes. An interpretation was performed
based on the density views of the co-citation analysis. In the density view, each point on
the map has a colour that signifies the density of items (the co-citations) measured up to
that point. Thus, each point represents a theme that has been researched by various
papers. The colour range was between red and blue; the larger the number of the items
in the neighbourhood of a point and the higher the weights of those neighbouring
items, the closer that point’s colour is to red. Conversely, the smaller the number of
items around the point and the lower the weights of those neighbouring items, the
closer the point’s colour is to blue.
The density view revealed the structure of the citation and co-citation figures. An area
with dense interconnections indicates that authors associated with this area have
received a significant number of citations, whereas authors associated with less dense
areas have received fewer citations. In this case, Manyika et al. and Chen received the
most citations, followed by Yang et al. A clear separation is also evident between the
works by Desouza/Jacob and Bauman et al. on one end and those of Yang, Zhen,
Manyika et al., Chen, and others on the other end. Clusters have formed due to similarity
of research themes. In other words, different authors have pursued similar themes in
their respective papers and this has resulted in a specific cluster. Figures 3 and 4 show
that there are mainly two clusters into which the research themes tended to fall. One
cluster contains Desouza/Jacob and Bauman et al., with these authors focusing on
exploring the application of big data in a specific sector. In their paper, Desouza and
20 N. SINGH
Jacob (2017) explored the limitations of big data applications in the public sector.
Likewise, Baumann et al. (2015) explored how big data applications used in the earth
sciences require different tools and techniques due to the need to process large
planetary observation datasets.
In the second cluster, there are three sub-clusters. The first sub-cluster contains Yang
et al. (2015a), who explored the potential benefit of big data applications in health care.
The same sub-cluster contains Chen and Zhang (2014), who discussed several meth-
odologies for managing data deluge such as granular computing, cloud computing, bio-
inspired computing, and quantum computing. In the second sub-cluster, there are four
authors who have completed survey-based studies on big data, mainly in the fields of
clinical data warehouses, emerging information technologies, and semantic information
retrieval. In the third sub-cluster, there is only one paper. This paper’s theme is different
from all others, as it focuses on a scalable software platform for the smart grid cyber-
physical system using cloud technologies (Simmhan et al. 2013).
The co-citation analysis showed Brynjolfsson et al. & Baesens as the most co-cited
authors. The links were also found to be strong, as indicated in Table 8, 9, 10 and 11.
Analyses of Tables 9 and 11 led us to conclude that there was substantial co-citation in the
relatively short period of three years. It can also be inferred from Figure 3 that big data
research has appeared in most of the major journals that are part of this study. Furthermore,
analysis by journals (Figure 4) shows strong relationships across different journals and that
research on big data is ubiquitous across disciplines (e.g. accounting, finance, law, manage-
ment, nature, etc.).
We observe in Figure 4 that three prominent clusters emerged on the map. The
largest cluster contains public policy, economics, organisation science, and law, which
can be seen in red on the left portion of the map. The most frequently occurring terms
in this cluster include technology management, business policy, economics, law, opi-
nion, public policy, organisation, and several others. The most interesting aspect of this
22 N. SINGH
cluster is its interdisciplinary nature. The red cluster bridges technology with economics,
organisational issues, law, and business policy, and it also contains many terms related
to basic science research. Prominent journals in this cluster include the Harvard Business
Review, Communications of the ACM, Harvard Law Review, American Economic Review,
Accounting and Business Research, and Organization Science. Interestingly, journals like
Science and Nature also appear in this cluster, though they are on the extreme side. This
indicates that many research issues in this cluster have origins in or are interlinked with
the basic sciences. This is relevant considering big data research has origins in comput-
ing science. This cluster is the most widely dispersed, with terms scattered among
economics, law, computer science, and public policy. Terms that are more often asso-
ciated with the basic sciences (from journals like Science and Nature) are found on the
extreme side of this cluster. In this case, terms such as software development, program-
ming, coding, and project management intermingle with other terms in the cross-
section with the second (green) cluster.
The second cluster (green) is on the right portion of the map. It is the next largest
cluster and includes terms related to the scholarship of technology and information
sciences. The most frequently occurring journals in this cluster are MIS Quarterly, Journal
of Information Technology, The Information Society, and Strategic Management Journal,
among others. The most frequently occurring terms include information technology,
technology strategy, big data information literacy, database management systems, and
analytics. It is interesting to note the overlap between this cluster and the management
science cluster (blue). The term ‘information management’ spans the boundary between
these two clusters.
ENTERPRISE INFORMATION SYSTEMS 23
Of the three present clusters, the smallest cluster is related to management science (blue)
and is spread across the upper portion of the map. Journals like MIT Sloan Management
Review, Academy of Management Review, Accounting, and Organizations & Society are
represented in this cluster. The most frequently occurring terms here relate to management
paradigms, administration, and people. The papers belonging to this cluster frequently
mention terms related to management of technology, information technology adoption,
employee engagement, return on investment, and business value. These terms span the
cross-section of information technology and management science.
The most interesting feature of the information technology cluster (green) is where it
intersects with other clusters. There is significant overlap between big data and other
areas, as information technology is an interdisciplinary field. At the intersection of the
information technology cluster with the management science cluster (blue), we find
terms associated with management research, such as information technology develop-
ment, adoption, and management. Additionally, terms such as data management,
compliance, data integration, employee engagement, and return on investment are
found on the edges of the green, blue, and red clusters. These terms are associated
with technology and business-related research. We also observe that, in all three
clusters, there are several terms that indicate considerable use of social media and
surveys as data collection methods.
We observed clearly articulated research themes emanating from the literature. One
theme was related to the engineering side of big data. Within this theme, we observed
sub-themes like exascale computing, Apache Hadoop, and unified engines for big data
processing. Parallel to the engineering theme, another research theme centred on
information management. In this context, surveillance capitalism, information civilisa-
tion, data management, analytics and the adoption of big data were key research sub-
topics. Figure 3 also showed that research themes permeated across business sectors
(e.g. accounting, healthcare, media, urban planning, corporate finance, etc.) Apparently,
researchers are addressing challenges not only within engineering and information
management. They are also investigating the implications of these challenges across
business sectors. We also observed a growing research interest in machine learning and
predictive analytics, as highlighted in seminal studies (Zuboff 2015; Reed and Dongarra
2015; Yang et al. 2015a). We expect an increasing research focus on predictive analytics,
high-performance computing, and machine learning, as these are key research sub-
themes. The papers in these sub-themes highlight the analytical and computing chal-
lenges faced by businesses.
Discussion
Contributions to theory
The current manuscript contributes to the literature on big data and extends research
papers and reviews in the area of big data (Zaharia et al. 2016; Reed and Dongarra 2015;
Davenport 2013; Constantiou and Kallinikos 2015; Babiceanu and Seker 2016; Yang et al.
2015a, 2015b; Baesens et al. 2014; Metcalf and Crawford 2016; Phillips-Wren et al. 2015).
It does this in the following ways. First, it adds to a systematic literature review of big
data by proposing and applying PCA and citation techniques (in additional to a co-
24 N. SINGH
citation analysis) to compare different studies. Second, through PCA, our analysis iden-
tifies three themes or components in big data research. The first theme captures
a component that could be termed ‘big data management’. Likewise, the second and
third components relate to the constructs ‘data services’ and ‘intelligence’, respectively.
The construct ‘big data management’ describes the management and execution of
engineering activities and the deployment of technologies for the same. The second
component, ‘data services’, relates to data cleaning and service providers who offer such
services. The third component relates to services for analytics. Third, the analysis also
illustrates the relationships between the components and argues that a better concep-
tualisation and use of techniques will result in better applications of big data. Therefore,
future research should consider these components.
Fourth, through citation and co-citation, our study found a difference between the
works of Desouza/Jacob and Bauman et al. on one end and the research of Yang, Zhen,
Manyika et al., Chen, and others on the other end (Desouza and Jacob 2017; Baumann
et al. 2015; Yang et al. 2015a; Manyika, Chui, and Brown et al. 2011; Chen and Zhang
2014). Clusters formed because of similarities or differences between research themes of
authors in their respective papers. Fifth, through citation and co-citation analysis, two
main clusters were identified with separate research themes. Analysing the journals
showed a strong relationship across different journals and also indicated that research
on big data is ubiquitous across disciplines (e.g. accounting, finance, law, management
of nature, management, etc.).
Big data have drawn significant attention from researchers, and the research is still in
a growth phase. Therefore, there is a need to continue the type of research exemplified
by the current study. We also observed that machine learning and predictive analytics
are being increasingly discussed, as demonstrated by seminal studies (Zuboff 2015; Reed
and Dongarra 2015; Yang et al. 2015a). For future research, it would be useful to adopt
other methods for this type of analysis and to observe the results.
use various technologies to move data, for example, from a SQL type environment into
Hadoop (or streaming data into Hadoop).
Another business issue relates to relative focus. The question arises – should the
relative focus on any of these components be a function of a company’s market niche
and its competitive focus? The findings suggest that the factor ‘big data management’
had a higher loading on analytics than did ‘intelligence’. However, on the analytics side,
when it is reasonable to expect that factor 3 (‘intelligence’) loads on to it, we found that
‘big data management’ had a higher loading on analytics than did ‘intelligence’.
These findings shed light on the two other business pain points that companies
face – which type of big data services do companies need? Do these services fall more
on the data management side or on the analytics and visualisation side? One plausible
explanation is that the scope of analytics is greater than visualisation or data mining. For
example, exploratory, descriptive, and inferential analytics are supported by big data
ecosystem technologies. The second component, ‘data services’, entails functional ser-
vices related to data cleaning. Companies with limited data readiness or companies that
buy data from third-party service providers would need big data services. Once they
extend beyond this stage, such companies would be more effective in using ‘intelli-
gence’ and thereby analytics. The three components are functionally linked to each
other in that ‘big data management’ and ‘data services’ are precursors for deriving any
‘intelligence’ in the big data ecosystem.
a paradigm shift, as evidenced from the high-quality research that has recently been
conducted. As civilisation moves into future, this research will have a significant impact
on our worldly knowledge.
Conclusions
In this study, we presented a review of the recent big data research. We analysed the
extant research in three ways – a literature, a PCA, and a bibliometric analysis – and
discussed the components that emerged from the PCA. The findings show that extant
research is centred on how big data improves the businesses that acquire it, manage it,
and derive new insights from it through data analytics. The PCA identified three major
components (or themes) – ‘big data management’, ‘data services’, and intelligence’ –
which each capture qualities that can be classified as specific constructs. The three
components are functionally linked to each other in that ‘big data management’ and
‘data services ‘are precursors to ‘intelligence’ in the big data ecosystem. We proved the
issue further by investigating the interconnectedness of papers to identify overlapping
research themes to understand common research themes emerging from the biblio-
graphic networks. The citation and co-citation analyses showed that big data research
has been strongly influenced by themes in engineering and information management. It
was also found that the research themes are spread across various business sectors.
Interestingly, machine learning and predictive analytics have been increasingly dis-
cussed as analytical tools to harvest data. The bibliometric analysis demonstrated that
predictive accuracy, robust data analytics, and high-performance computing were also
key research sub-themes. We invite future research on this topic to further develop an
understanding of big data.
ENTERPRISE INFORMATION SYSTEMS 27
Acknowledgments
The author is grateful to the Editor-in-Chief and anonymous referees whose valuable comments
and suggestions substantially helped improve this article.
Disclosure statement
No potential conflict of interest was reported by the author.
ORCID
Nitin Singh http://orcid.org/0000-0002-9003-3310
References
Ahmad, F., and A. Sarkar. 2016. “QaaS (Quality as a Service) Model for Web Services Using Big Data
Technologies.” Enterprise Information Systems 11 (9): 1352–1373.
Babiceanu, R. F., and R. Seker. 2016. “Big Data and Virtualization for Manufacturing Cyber-Physical
Systems: A Survey of the Current Status and Future Outlook.” Computers in Industry 81: 128–137.
doi:10.1016/j.compind.2016.02.004.
Back, B.-H., and H. Il-Kyu. (2017). “A Platform for Supporting Automatic Data Storing and
Visualization of Public and Private Big Data.” ACM International Conference on Big Data
Research. Osaka, Japan. 12–17. doi: 10.2460/ajvr.78.1.12.
Baesens, B., R. Bapna, J. R. Marsden, J. Vanthienen, and J. L. Zhao. 2014. “Transformational Issues of
Big Data and Analytics in Networked Business.” MIS Quarterly 38 (2): 629–631.
Baesens, B., S. De Winne, and L. Sels. 2017. “Is Your Company Ready for HR Analytics?” MIT Sloan
Management Review 58 (2): 20.
Baughman, A. K., B. Richard., B. Harrison, B. O’Connell, H. Pearthree, F. Brandon., C. McAvoy, S. Sun,
and C. Upton. 2016. “IBM Predicts Cloud Computing Demand for Sports Tournaments.”
Interfaces 46 (1): 33–48. doi:10.1287/inte.2015.0820.
Baumann, P. (2017). “Standardizing Big Earth Datacubes”, IEEE International Conference on Big
Data. Boston, Boston, USA. 67–73.
Baumann P., Mazzetti P., Ungar J., Barbera R., Barboni B., Beccati A., Bigagli L., et al. 2015. “Big Data
Analytics for Earth Sciences: The EarthServer Approach.” International Journal of Digital Earth 9
(1): 3–29. doi:10.1080/17538947.2014.1003106.
Benbernou, S., and M. Ouziri (2017). “Enhancing Data Quality by Cleaning Inconsistent Big RDF
Data.” IEEE International Conference on Big Data, 74–79. doi: 10.1186/s12912-017-0268-5.
Bhimani, A. 2015. “Exploring Big Datas Strategic Consequences.” Journal of Information Technology
30 (1): 66–69. doi:10.1057/jit.2014.29.
Bichler, M., A. Heinzl, and W. M. van der Aalst. 2017. “Business Analytics and Data Science: Once
Again?” Business & Information Systems Engineering 59 (2): 77–79. doi:10.1007/s12599-016-0461-1.
Brynjolfsson, E., T. Geva, and S. Reichman. 2015. “Crowd-Squared: Amplifying the Predictive Power
of Search Trend Data.” MIS Quarterly 40 (4): 941–961. doi:10.25300/MISQ/2016/40.4.07.
CACM staff. 2017. “Big Data.” Communications of the ACM 60 (6): 24–25. doi:10.1145/3079064.
Chai, S., and W. Shih. 2017. “Why Big Data Isn’t Enough.” MIT Sloan Management Review 58 (2): 57.
Chen, P., and C. Zhang. 2014. “Data-Intensive Applications, Challenges, Techniques and
Technologies: A Survey on Big Data.” Information Sciences 275: 314–347. doi:10.1016/j.
ins.2014.01.015.
Chun Kit, N. G., W. Chun Ho, Y. Kai Leung, I. Wai Hung, and T. Cheung. 2018. “A Semantic Similarity
Analysis of Internet of Things.” Enterprise Information Systems 12 (7): 820–855. doi:10.1080/
17517575.2018.1464666.
28 N. SINGH
Constantiou, I. D., and J. Kallinikos. 2015. “New Games, New Rules: Big Data and the Changing
Context of Strategy.” Journal of Information Technology 30 (1): 44–57. doi:10.1057/jit.2014.17.
Cross, R. 2015. Principal Component Analysis Handbook. Clanrye International. Crossref at https://
www.crossref.org; Accessed on Dec 2018
Da Xu, L., and L. Duan. 2018. “Big Data for Cyber Physical Systems in Industry 4.0: A Survey.”
Enterprise Information Systems 13 (2): 148–169.
Date, S. 2016. “Should You Upload or Ship Big Data to the Cloud?” Communications of the ACM 59
(7): 44–51. doi:10.1145/2963119.
Davenport, T. H. 2013. “Analytics 3.0.” Harvard Business Review. December.
de Almeida, D. C. P., and J. Bernardino (2015). “Big Data Open Source Platforms.” IEEE International
Congress on Big Data, New York, USA. 268–275.
Demirkan, H., C. Bess, J. Spohrer, A. Rayes, D. Allen, and Y. Moghaddam. 2015. “Innovations with
Smart Service Systems: Analytics, Big Data, Cognitive Assistance, and the Internet of
Everything.” Communications of the AIS 37: 35.
Desouza, K., and B. Jacob. 2017. “Big Data in the Public Sector: Lessons for Practitioners and
Scholars.” Administration & Society 49 (7): 1043–1064. doi:10.1177/0095399714555751.
Elshater, Y., P. Martin, D. Rope, M. McRoberts, and C. Statchuk (2015). “A Study of Data Locality in
YARN.” IEEE International Congress on Big Data, New York, USA. 174–181.
Emmanuel, I., and C. Stanier (2016). “Defining Big Data.” International Conference on Big Data and
Advanced Wireless Technologies, Blagoevgrad, Bulgaria. Article No.: 5 ERA, portal.core.edu.au/
conf-ranks/; Accessed on Dec 2018
Fitzgerald, M. 2015. Enhancing Intuition with Analytics at General Mills. Massachusetts Institute of
Technology: MIT Sloan Management Review.
Garfield, E. 1972. “Citation Analysis as a Tool in Journal Evaluation.” Science 178 (4060): 471–479.
Garfield, E. 1979. “Is Citation Analysis a Legitimate Evaluation Tool?” Scientometrics 1 (4): 359–375.
doi:10.1007/BF02019306.
Gartner. (2017). “Hype Cycle for Data Management.” www.Gartner.com
Gates, M., H. Anzt, J. Kurzak, and J. Dongarra 2015. “Accelerating Collaborative Filtering Using
Concepts from High Performance Computing.” IEEE International Conference on Big Data, Santa
Clara, CA.
Ghose, A., and V. Todri. 2015. “Towards a Digital Attribution Model: Measuring the Impact of
Display Advertising on Online Consumer Behavior.” MIS Quarterly 40 (4): 889–910. doi:10.25300/
MISQ/2016/40.4.05.
Ghosh, J. 2016. “Big Data Analytics: A Field of Opportunities for Information Systems and
Technology Researchers.” Journal of Global Information Technology Management 19 (4):
217–222. doi:10.1080/1097198X.2016.1249667.
Gupta, B., M. Goul, and B. Dinter. 2015. “Business Intelligence and Big Data in Higher Education:
Status of a Multi-Year Model Curriculum Development Effort for Business School
Undergraduates, MS Graduates, and MBAs.” Communications of the AIS 36: 23.
Holden, G. 2016. “Big Data and R&D Management.” Research-Technology Management 59 (5):
22–26. doi:10.1080/08956308.2016.1208044.
Holmes, A. 2014. Hadoop in Practice. 2nd ed. New Delhi: Dreamtech Press.
Hong, J., L. Li, C. Han, B. Jin, Q. Yang, and Z. Yang (2016). “Optimizing Hadoop Framework for Solid
State Drives.” IEEE International Congress on Big Data, New York, USA. 9–17. doi: 10.1167/
tvst.5.6.9.
IDC, (2017). “Double-Digit Growth Forecast for the Worldwide Big Data and Business Analytics
Market through 2020 Led by Banking and Manufacturing Investments.” https://www.idc.com/
getdoc.jsp?containerId=prUS41826116
Itoh, M., D. Yokoyama, M. Toyoda, and M. Kitsuregawa (2015). “Visual Interface for Exploring
Caution Spots from Vehicle Recorder Big Data.” IEEE International Conference on Big Data.
Jollife, I. T. 2002. Principal Component Analysis. New York, NY: Springer.
Kaiser, H. F. 1974. “An Index of Factorial Simplicity.” Psychometrika 39: 31–36. doi:10.1007/
BF02291575.
ENTERPRISE INFORMATION SYSTEMS 29
Palvia, P., P. Pinjani, and E. H. Sibley. 2007. “A Profile of Information Systems Research Published in
Information & Management.” Information & Management 44 (1): 1–11. doi:10.1016/j.
im.2006.10.002.
Pham, C. (2016). “Internet-of-Thing and Reasons Why It Is Becoming a Reality.” International
Conference on Big Data and Advanced Wireless Technologies. Blagoevgrad, Bulgaria. Article No.: 1.
Phillips-Wren, G. E., L. S. Iyer, U. R. Kulkarni, and T. Ariyachandra. 2015. “Business Analytics in the
Context of Big Data: A Roadmap for Research.” Communications of the AIS 37 (23): 448–472.
Pilkington, A., and J. Meredith. 2009. “The Evolution of the Intellectual Structure of Operations
Management - 1980-2006; a Citation/Co-Citation Analysis.” Journal of Operations Management
27: 185–202. doi:10.1016/j.jom.2008.08.001.
Qazi, R. U. R., and A. Sher. 2016. “Big Data Applications in Businesses: An Overview.” The
International Technology Management Review 6 (2): 50–63. doi:10.2991/itmr.2016.6.2.3.
Ransbotham, S., D. Kiron, and P. K. Prentice. 2015. “The Talent Dividend.” MIT Sloan Management
Review 56 (4): 1.
Reed, D. A., and J. Dongarra. 2015. “Exascale Computing and Big Data.” Communications of the
ACM 58 (7): 56–68. doi:10.1145/2797100.
Saboo, A. R., V. Kumar, and I. Park. 2016. “Using Big Data to Model Time-Varying Effects for Marketing
Resource (Re) Allocation.” MIS Quarterly 40 (4): 911–939. doi:10.25300/MISQ/2016/40.4.06.
Sahay, S. 2016. “Big Data and Public Health: Challenges and Opportunities for Low and Middle
Income Countries.” Communications of the AIS 39: 20.
Seref, B., and E. Bostanci (2016). “Opportunities, Threats and Future Directions in Big Data for
Medical Wearables International Conference on Big Data and Advanced Wireless Technologies.”
International Conference on Big Data and Advanced Wireless Technologies Blagoevgrad,
Bulgaria. Article No.: 15.
Shiau, W.-L. 2016. “The Intellectual Core of Enterprise Information Systems: A Co-Citation Analysis.”
Enterprise Information Systems 10 (8): 815–844. doi:10.1080/17517575.2015.1019570.
Shim, J. P., J. Koh, S. Fister, and H. Y. Seo. 2016. “Phonetic Analytics Technology and Big Data:
Real-World Cases.” Communications of the ACM 59 (2): 84–90. doi:10.1145/2886013.
Simmhan, Y., S. Aman, A. Alok Kumbhare, and L. Rongyang. 2013. “Cloud-Based Software Platform
for Big Data Analytics in Smart Grids.” Computing in Science & Engineering 15 (4): 38–47.
doi:10.1109/MCSE.2013.39.
Small, H. 1973. “Co-Citation in the Scientific Literature: A New Measure of the Relationship
between Two Documents.” Journal of the American Society for Information Science 24 (4):
265–269. doi:10.1002/(ISSN)1097-4571.
Turel, O., and B. Kapoor. 2016. “A Business Analytics Maturity Perspective on the Gap between
Business Schools and Presumed Industry Needs.” Communications of the AIS 39: 6.
Villanustre, F. (2015). “Industrial Big Data Analytics: Lessons from the Trenches.” IEEE/ACM 1st
International Workshop on Big Data Software Engineering, Florence, Italy, 1–3.
Wang, N., H. Liang, Y. Jia, S. Ge, Y. Xue, and Z. Wang. 2016. “Cloud Computing Research in the IS
Discipline: A Citation/Co-Citation Analysis.” Decision Support Systems 86 (C): 35–47. doi:10.1016/j.
dss.2016.03.006.
Winig, L. 2017. “A Data-Driven Approach to Customer Relationships: A Case Study of Nedbank’s
Data Practices in South Africa.” MIT Sloan Management Review 58 (2).
Woerner, S. L., and B. H. Wixom. 2015. “Big Data: Extending the Business Strategy Toolbox.” Journal
of Information Technology 30 (1): 60–62. doi:10.1057/jit.2014.31.
Yan., Z. (2017 Oct). “A Method of Related Parameters Combinatorial Optimization of Large Data
Platform Based on MapReduce.” ACM International Conference on Big Data Research. Osaka,
Japan. 18–25.
Yang, -J.-J., J. Li, J. Mulder, Y. Wang, S. Chen, H. Wu, Q. Wang, and H. Pan. 2015a. “Emerging
Information Technologies for Enhanced Healthcare.” Computers in Industry 69: 3–11.
doi:10.1016/j.compind.2015.01.012.
Yang, X., N. Liu, B. Feng, X.-H. Sun, and S. Zhou (2015b). “PortHadoop: Support Direct HPC Data
Processing in Hadoop.” IEEE Conference on Big Data, Santa Clara, CA.
ENTERPRISE INFORMATION SYSTEMS 31
Yoo, Y. 2015. “It Is Not about Size: A Further Thought on Big Data.” Journal of Information
Technology 30 (1): 63–65. doi:10.1057/jit.2014.30.
Zaharia, M., R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, et al. 2016. “Apache Spark:
A Unified Engine for Big Data Processing.” Communications of the ACM 59 (11): 56–65.
doi:10.1145/2934664.
Zuboff, S. 2015. “Big Other: Surveillance Capitalism and the Prospects of an Information
Civilization.” Journal of Information Technology 30 (1): 75–89. doi:10.1057/jit.2015.5.