Professional Documents
Culture Documents
Corresponding Author
Abstract
This paper presents a scientometric analysis of research work done on the emerging area of ‘Big
Data’ during the recent years. Research on ‘Big Data’ started during last few years and within a
short span of time has gained tremendous momentum. It is now considered one of the most
important emerging areas of research in computational sciences and related disciplines. We have
analyzed the research output data on ‘Big Data’ during 2010-2014 indexed in both the Web of
Knowledge and Scopus. The analysis maps comprehensively the parameters of total output,
growth of output, authorship and country-level collaboration patterns, major contributors
(countries, institutions and individuals), top publication sources, thematic trendsand emerging
topics in the field. The paper presents an elaborate and one of its kind scientometric mapping of
research on ‘Big Data’.
Keywords
Big Data,Big Data Analytics, Informetrics, Scientometrics.
1. Introduction
The new World Wide Web and the rapid growth of E-Systems are producing huge amount of
structured and unstructured data. The ‘volume’ and ‘velocity’ of this data generation is so large
that traditional database and information systems technologies fail to manage and process the
data appropriately. Whether it’s the user generated data on the World Wide Web (for example
500 million tweets are generated everyday on Twitter)or the data produced in commercial or
customer interaction transactions, the volume and nature of the data needs new methods,
techniques and approaches to process it. As a result new research has started during last few
years, which is now progressing at a very fast pace researching about approaches and
technologies to manage ‘Big Data’. The research on ‘Big Data’ is now attracting attention from
academia, industry and even governments around the world. It is in this context that we have
We performed detailed scientometric analysis of the research output (publications) on ‘Big Data’
for a comprehensive and analytical mapping. The research output on ‘Big Data’ during 2010-
2014, indexed in both Web of Knowledge1 and Scopus2, are obtained and analyzed. The period
of last 5 years (2010-2014) is selected due to the fact that research on ‘Big Data’ started very
recently and gained momentum only during last few years. We have used the standard
scientometric methodology as well as the text-analytics based approaches for the mapping
exercise. The analysis obtains very useful and comprehensive account of research on ‘Big Data’
by illustrating all major aspects includingyear-wise research output and growth rate, authorship
1
https://apps.webofknowledge.com
2
http://www.scopus.com
and collaboration patterns, major contributors (countries, institutions, individuals), main
The rest of the paper is organized as follows: The section 2 presents a brief overview of
importance of ‘Big Data’ research and some related work on scientometric analysis on different
narrow research themes that helped us in formulating the research plan. Section 3 describes the
data collection and methodology used. Section 4 describes the analytical outcomes on
quantification and growth of research output and section 5 presents the analytical outcomes on
authorship and collaboration patterns. The section 6 illustrates the major contributors (countries,
institutions, individuals and publication sources). Section 7 describes the main disciplines related
to the ‘Big Data’ research. The paper concludes in, section 8, with a short summary and
The term ‘Big Data’ is now a well-known and depicts a very important area of research. It has
become so important during last few years that Nature and Science have published special issues
dedicated to discuss the opportunities and challenges brought by ‘Big Data’ (Nature 455, 2008;
Science 331, 2011). Compared to traditional data, the features of ‘Big Data’ are characterized by
5V, namely, huge Volume, high Velocity, high Variety, low Veracity, and high Value (Jin et al.,
2015). Nowadays large and complex data sets are being collected for diverse reasons through all
kinds of technologies including mobile devices, remote sensing, software logs, wireless sensor
networks, social media etc. The characteristics of this ‘Big Data’ are such that we need new
theories, novel methods and right analytics tools to help scientists and business leaders make
sense of the volume of data. More precisely, what we need is research towards effective ways of
tuning ‘Big Data’ into ‘Big insights’. Mc Kinsey, the well-known management and consulting
firm, states that ‘Big Data’ has penetrated into every area of today’s industry and business
functions (Manyika et al., 2011). It is also true that ‘Big Data’ techniques and data science now
heavily influence how we conduct research across various domains including economics,
business, finance, biological sciences, health care, social sciences and the humanities (Wu &
Chin, 2014). Nations across the world have realized the potential of ‘Big Data’ research and
instituted special national programs and initiatives on ‘Big Data’ research. It is in this context
that we tried to measure and map the scientific research on ‘Big Data’ using standard
There exists plenty of research work on scientometric mapping of research work in a particular
discipline or a narrow research theme. Though we could not find any previous work that aims to
perform a detailed and systematic scientometric mapping on the theme of ‘Big Data’,
nevertheless previous works on different disciplines and narrow themes helped us in formulating
our research plan. We found primarily three directions of scientometric mapping in previous
works: (a) scientometric mapping on a particular subject discipline (say Computer Science), with
or without focus on a particular country/ region (for example Gupta et al., 2011; Kumar and
Garg, 2005; Singhal et al., 2014; Uddin and Singh, 2014;Ma et al., 2008); (b) scientometric
mapping of research in a narrow research theme (say Nanotechnology), with or without focus on
a particular country/ region (for example Karpagam et al., 2012; Karpagam et al., 2011; Finardi,
2011; Onel et al., 2011; Liesch et al., 2011;Cocosila et al., 2011; Jarić et al. 2012); and (c) a
comparative study of research competitivenessof institutions/ countries in one or more subject
disciplines (for example Singh et al., 2015). The only relatedresearch work we could found on
‘Big Data’ are those that discusses several questions and future prospects of ‘Big Data’. For
example, Boyd and Crawford (2012) tried to pointing some crucial questions and answers related
to ‘Big Data’ research in their work. Howe et al. (2008) described the future aspects and growth
of big data research theme, along with a discussion on vision for the ‘Big Data’ research theme.
Park &Leydesdorff (2013) examined the social and semantic networks that emerge in the ‘Big
Data’. Ekbia et al. (2014) perform a critical review of ‘Big Data’ by trying to conceptualize it
and illustrating the dilemmas andJagadish (2015) describes myths and realities around the ‘Big
Data’. To the best of our knowledge, this is the first work which performs a systematic and
We have collected research output data for ‘Big Data’ theme from both Web of Knowledge
(WoK) and Scopus for the period of last 5 years i.e., 2010 to 2014. In WoK, we found a total of
1,415 records as a result of the search query [TS = (BIGDATA OR "BIG DATA")
comprises of records of the type article, book review, review, meeting abstract, proceedings
paper, note, editorial material, letter etc. Each record in WoK data contains 60 fields containing
meta-data about the records, such as paper title (TI), author address (C1), citation references (Z9)
etc. In Scopus, we found a total of 6,810 records as a result of search query [TITLE-ABS-KEY
article, conference review, review, article in press, editorial, short survey, note, book chapter,
letter, book, erratum etc. In Scopus, each record consists of 41 fields describing different
attributes such as abstract, paper title, author with affiliations etc. We have used the information
We have followed the standard Scientometric methodology to compute various parameters like
Relative Growth Rate (RGR),Doubling Time (DT),Collaboration Index (CI), Collaborative Co-
efficient (CC), International Collaborative Papers (ICP), G-index, HG-index, P- index etc. We
have also identified authorship patterns, top journals publishing research on ‘Big Data’, most
productive institutions and authors on ‘Big Data’ research. Further, we extracted cliques of
authors for top three most productive authors and also characterized top authors on a TP-TC plot.
Secondly, we used a text-analytics based approach to identify major disciplines in which “Big
Data’ research has been done. A frequency based analysis helped us in identifying main author
keywords in research output on ‘Big Data’, out of which we selected some important high
frequency terms (such as Hadoop, map reduce etc.) as control terms. The year-wise research
output pattern on all the control terms is plotted. A topic density plot for the selected control
terms is also drawn to visualize the major research topics in the area. We have also used author
keyword information to identify important new terms appearing as author keywords in research
We began our scientometric analysis with a year-wise summary of research papers produced on
‘Big Data’ as obtained from WoK and Scopus indices. Thereafter, we computed two useful and
informative parameters, namely the Relative Growth Rate (RGR) and Doubling Time (DT),
during the period 2010-2014. The RGR represents growth in research output and is computed as
follows:
= ( 2 − 1) / ( 2 − 1)
where, CN2 and CN1 are the cumulative number of publications in the years T2 and T1.Since we
have computed RGR year-wise, time difference in our case is 1 year. The expression is thus
reduced to:
= ( 2/ 1)
The parameter Doubling Time (DT) is directly related to RGR and indicates the time required for
= (( 2 − 1) ∗ 2) / ( 2 − 1)
= 2/
The table 1presents the sequential distribution of research output, cumulative output, RGR, DT,
mean RGR and mean DT for the period 2010-2014 for data obtained from both WoK and
Scopus.We can see from the table that total research output in both cases of WoK and Scopus
has increasedsignificantly. The RGR and DT values though impressive for an emerging
discipline, fluctuate for later years. Overall, there is a clear trend of high growth in research
output on ‘Big Data’ as seen from data from WoK and Scopus. We have also computed country-
wise research output distribution of the data obtained from WoK andScopus. The table 2presents
the year-wise research output, indexed in WoK and Scopus, for some of the top output producing
countries. We observe that out of 1,415 and 6,810 publication records in WoK and Scopus,
respectively, 48.98 % and 17.05 % contribution is that of United States alone. China, United
Kingdom and Germany stand at 2nd, 3rd and 4th position, respectively, in terms of the total
research output produced. We have also plotted the country-level collaboration network in figure
1 to get an idea about the country-level ICP characteristics of ‘Big Data’ research. It can be
clearly observed form the figure that ‘United States - China’ tie is the strongest ICP instance
followed by ‘United States – United Kingdom’. Further, ‘United States’ has the highest ICP
Our second parameter of analysis is authorship and collaboration patterns observed in research
output on ‘Big Data’. In addition to plotting year-wise authorship trend (1, 2, 3 and >3 authors),
we have also computed standard parameters Collaboration Index (CI), Degree of Collaboration
(DC) and Collaborative Coefficient (CC). The CI measures mean number of authors per
should have a value between 0 and 1, where 0 corresponds to all output being single authored
and 1 represents all papers being maximally authored(Ajiferuke et a. 1988). We define the
∑
=
This index results mean number of authors per paper. This index has no upper limit, hence
cannot be interpreted as degree. Further, it gives a non-zero weight to single authored papers i.e.
non collaborative papers. Therefore, other parameters are also computed. The Degree of
=1−
where, f1 is the number of single authored papers. This index can be interpreted as degree as its
value lies between ‘0’ and ‘1’ and it gives ‘0’ weight to single authored papers and value ‘1’ for
maximum collaboration. It ranks higher a discipline with higher number of multi authored papers
but doesn't differentiate between the multiple authorship levels. The Collaborative Coefficient
∑
=1−
Here, every paper contains a definite amount of credit.Each author gets 1/j credit for a paper with
j authors. The value of CC lies between 0 and 1. This parameter has both the upper bound and
the distinguishing capacity between various multi-authored papers. We have computed all these
parameters for the data. The table 3 shows the year-wise distribution of number of papers having
1, 2, 3and >3 authors and the CI, DC, and CC values, for both the WoK and the Scopus data. We
6. Major Contributors
We have identified major journals publishing research on ‘Big Data’, the top research output
producing institutions and the most productive authors. First of all, we analyzed the WoK and
Scopus data collected to identify the most important journals that publish highly the research
output on ‘Big Data’. We have also computed H-index (Hirsch, 2005), Total Citations (TC) and
Average Citation Per Paper (ACPP) values for each of these journals. The table 4shows the top
journals (arranged according to Total Papers (TP) in WoK)) that published research on Big Data
during the last 5 years. We observe that ‘Computer’ magazine published by IEEE Computer
Society tops the list with a total of 26 papers, with aggregate ACPP 1.62 and aggregate H-index
value of 4.This is followed by journal ‘Plos One’ and so on. The journals, ‘Future Generation
Computer Systems’, ‘Health Affairs’, ‘Nature’ and ‘Science’ are other prominent journals that
After identifying top publication sources, we moved to identify the major institutions having
significant amount of research published on ‘Big Data’. We analyzed the data and identified the
top contributing institutions to the ‘Big Data’ research during the 5-year period. We have
computed TC, ACPP, H-index, G-index, HG-index and P-index values for the data
corresponding to each of these institutions.The, G-index (Egghe, 2006) is calculated based on the
received, the G-index is the (unique) largest number such that the top g articles received
=√ ∗
/
=( . )
where, P is total number of papers and C is total citations. The P-index gives perfect stability
between quality (C/P) and quantity C. The table 5 shows the top 15 contributing institutions to
the ‘Big Data’ research as measured from WoK data. The table displays the TP, TC, ACPP, H-
index, G-index, HG-index and P-index values. The ‘Harvard University’stands at first place on
all the parameters. The ‘University of London’, ‘MIT’ and ‘Stanford University’ are few other
major contributors to ‘Big Data’ research during the last 5-year period. Different institutions,
We have also identified the most productive authors on ‘Big Data’ research from the WoK data
obtained. The table 6 shows the 10most productive authors identified for the 5-year research
data. The TP and TC values of these authors are also displayed. We observe that ‘JJ Chen’ and
‘Y Liu’ are the top two most productive authors on the 5-year research output on ‘Big Data’. We
have also identified the co-authorship cliques for the top authors. The figures 2, 3 and 4show the
co-authorship cliquesfor the first three most productive authors.Further, we have also plotted the
top 10 most productive as well as top 10 most cited authors on a TP-TC plot to identify the most
productive authors and their impact. The figure 5 shows the top 10 most productive authors and
10 most cited authors plotted on a TP-TC plot. We observe thatnone of the authors ranked in
both most productive and most cited lists (list of top 10 authors based on WoK data).
The research on ‘Big Data’ is not confined to Computer Science only. Many disciplines have
contributed to different aspects of ‘Big Data’ research. We have tried to identify the discipline-
wise research output for ‘Big Data’ from the WoK data. The number and details of disciplines
used is described in the Appendix. We mapped multiple subject classes of WoK to broader
broader disciplines along with their percentage contribution to the total research output indexed
in WoK for the 5-year period. We observe that Computer Science contributes a total of 708 out
of 1,415 publications, which constitutes approximately 50% of the total output. Thus, contrary to
what one may believe, about 50% of the ‘Big Data’ research output is from disciplines other than
Sciences, Medical Sciences, Management and Healthcare are some of the major contributing
disciplines to ‘Big Data’ research. A research publication may belong to more than one
discipline (due to interdisciplinary outputs)and hence the total percentage value here is greater
than 100.
The second major text-analytics based outcome that we tried to derive is about the major
research themes/ topics in ‘Big Data’ research. For this purpose, first of all we extracted all
distinct author keywords in the WoK and Scopus research output data. The occurrence
frequencies for all the distinct author keywords are computed and the author keywords are
high-frequency important terms (hereafter called control terms) and identified the number of
research papers on that keyword. The table 8 shows the year-wise distribution of research output
on selected control terms. We see that ‘business intelligence’, ‘cloud computing’, ‘clustering’,
‘map reduce’, ‘hadoop’, ‘nosql’ are some of the prominent control terms. A significant amount
of research output is on the selected control terms that happen to be the major themes of research
in ‘Big Data’. We have also plotted the control terms on a density plot in figure
data.The density plot also shows the prominent research themes/ topics in ‘Big Data’ research.
8. Conclusion
We have successfully performed an analytical mapping of research on ‘Big Data’ during the last
5-year period. The research output data from both WoK and Scopus is used for the mapping and
detailed characterization of the ‘Big Data’ research. We have presented analytical outcomes for
collaboration patterns and authorship type & collaboration patterns. All these analytical
outcomes include computation of standard scientometric parameter values, such as RGR, DT, CI,
DC, CC, H-index, G-index, HG-index, P-index etc. We have also identified major contributors to
3
http://www.vosviewer.com/Home
‘Big Data’ research in form of top journals publishing ‘Big Data’ research, top institutions
contributing to the research and the most productive and most cited authors in the area. In
approach to identify the discipline-wiseresearch output on ‘Big Data’. We identify the important
control terms, plot them in a density plot and map the research output on the control terms.
on emerging area of ‘Big Data’, which is very informative, useful and first of its kind on the
theme.
9. Acknowledgements
This work is supported by research grants from Department of Science and Technology,
Government of India (Grant: INT/MEXICO/P-13/2012) and University Grants Commission of
India (Grant: F. No. 41-624/ 2012(SR)).
Appendix:
Ajiferuke, I., Burell, Q., &Tague, J. (1988). Collaborative coefficient: A single measure of the
degree of collaboration in research. Scientometrics, 14(5), 421-433.
Alonso, S., Cabrerizo, F. J., Herrera-Viedma, E. and Herrera, F. (2010).hg-index: A new index to
characterize the scientific output of researchers based on the h-and g-indices. Scientometrics,
82(2), 391-400.
Boyd, D. and Crawford, K. (2012) Critical Questions for Big Data.Information, Communication
& Society, 15:5, 662-679, DOI: 10.1080/1369118X.2012.678878
Cocosila, M., Serenko, A. and Turel, O. (2011).Exploring the management information systems
discipline: a scientometric study of ICIS, PACIS and ASAC.Scientometrics, 87(1), 1-16.
Ekbia, H., Mattioli, M., Kouper I., Arave.G., Ghazinejad, A., Bowman, T., Suri V.R., Tsou, A.,
Weingart, S. and Sugimoto, C.R. (2014) Big Data, Bigger Dilemmas: A Critical Review.Journal
of the Association for Information Science and Technology. DOI: 10.1002/asi.23294.
Finardi, U. (2011). Time relations between scientific production and patenting of knowledge: the
case of nanotechnologies. Scientometrics, 89(1), 37-50.
Gupta, B.M., Kshitij, A. and Verma, C. (2011).Mapping of Indian computer science research
output, 1999–2008.Scientometrics, 86(2), 261–283.
Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W. and Rhee, S.Y. (2008).
Big data: The future of biocuration. Nature, 455(7209), 47-50.
Jagadish, H.V. (2015) Big Data and Science: Myths and Reality. Big Data Research, (2), 49-52.
Jarić, I., Cvijanović, G., Knežević-Jarić, J. and Lenhardt, M. (2012). Trends in fisheries science
from 2000 to 2009: a bibliometric study. Reviews in Fisheries Science, 20(2), 70-79.
Jin X., Wah, B.W., Cheng, X. and Wang, Y. (2015) Significance and Challenges of Big Data
Research.Big Data Research (2), 59-64.
Karpagam, R., Gopalakrishnan, S., Babu, B.R. and Natarajan, M. (2012). Scientometric Analysis
of Stem cell Research: A comparative study of India and other countries. Collnet Journal
of Scientometrics and Information Management, 6(2), 229-252.
Karpagam, R., Gopalakrishnan, S., Natarajan, M., and Babu, B.R. (2011).Mapping of
nanoscience and nanotechnology research in India: a scientometric analysis, 1990–
2009.Scientometrics, 89(2), 501-522.
Kumar, S. and Garg, K.C. (2005). Scientometrics of computer science research in India and
China. Scientometrics, 64(2), 121-132.
Liesch, P.W., Håkanson, L., McGaughey, S.L., Middleton, S. and Cretchley, J. (2011). The
evolution of the international business field: a scientometric investigation of articles published in
its premier journal. Scientometrics, 88(1), 17-42.
Manyika J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. and Hung, A. (2011) Big
Data: The next frontier for innovation, competition, and productivity, Technical Report,
McKinsey Global Institute.
Onel, S., Zeid, A. and Kamarthi, S. (2011). The structure and analysis of nanotechnology co-
author and citation networks.Scientometrics, 89(1), 119-138.
Park, H.W. and Leydesdorff, L. (2013).Decomposing social and semantic networks in emerging
“big data” research.Journal of Informetrics, 7(3), 756-765.
Prathap, G. (2010). The 100 most prolific economists using the p-index.Scientometrics, 84(1),
167-172.
Singh, V.K., Uddin, A. and Pinto, D (2015). Computer Science Research: The Top 100
Institutions in India and in the World.Scientometrics 104(2), 529-553.
Singhal K., Banshal S.K., Uddin A. and Singh V.K. (2014).The information technology
knowledge infrastructure and research in South Asia.Journal of Scientometric Research, 3(4),
134-42.
Uddin, A. and Singh, V.K. (2014).Mapping the Computer Science Research in SAARC
Countries.IETE Technical Review, 31(4), 287-296.
Wu, Z. and Chin, O.B. (2014) From Big Data to Data Science: A Multi-disciplinary Perspective.
Big Data Research (1), 1-1.