You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/345573305

EVOLUTION OF BIG DATA AND TOOLS FOR BIG DATA ANALYTICS

Article in Journal of Interdisciplinary Cycle Research · October 2020

CITATIONS READS
0 551

1 author:

Ayesha Banu Mohd


Vaagdevi College of Engineering
9 PUBLICATIONS 1 CITATION

SEE PROFILE

All content following this page was uploaded by Ayesha Banu Mohd on 09 November 2020.

The user has requested enhancement of the downloaded file.


Journal of Interdisciplinary Cycle Research ISSN NO: 0022-1945

EVOLUTION OF BIG DATA AND TOOLS FOR BIG DATA


ANALYTICS

Dr. Ayesha Banu, Dr. Md. Yakub


1. Assistant Professor, Dept. of CSE, Vaagdevi College of Engineering.
2. Assistant Professor of Commerce, Govt. Degree College, Mulugu.
ayesha_b@vaagdevi.edu.in, mdyakub02@gmail.com

ABSTRACT

Internet has revolutionized the computer and communications world like never before. It is an
important part in today's generation in providing a very useful way to connect to the people via
social networking sites like face book, twitter, instagram, Google, yahoo etc. Today, our social,
personal as well as professional life is revolving around World Wide Web. With so much
information at our fingertips, we’re adding loads of data to the data store every time we turn to
our search engines for answers. This is giving birth to Big Data at an incredible momentum.
With the rising Big Data, it becomes very difficult to perform effective analysis using the
existing traditional data management tools and frameworks. Due to its various properties like
volume, velocity, variety, variability, value and complexity Big Data put forward many
challenges. Now Companies are moving towards Big Data tools and technologies for data
analytics and decision making. This paper emphasizes on the evolution of Big Data and put light
on the various challenges and issues in adapting and accepting Big Data technology. Further
focus will be on the tools available to handle the greater volume and variety of data and
capability of these tools in Big Data Analytics.

KEY WORDS: Internet, Big Data, Data Analytics, Decision Making, Hadoop, MongoDB.

Volume XII, Issue X, October/2020 Page No:309


Journal of Interdisciplinary Cycle Research ISSN NO: 0022-1945

1. Introduction:

Big Data is not a completely new term; its history starts many years before the present buzz
around Big Data. Many decades ago itself people started using data analysis and analytics
techniques to support their decision-making process. However, in the last two decades, due to
tremendous usage of internet and social networking the amount with which data increased
became beyond measures of human comprehension. In simple words, Doug Laney [1],
characterize Big Data in terms of three V’s and defines “"Big Data as a circumstance where the
volume, velocity and variety of data of an organization’s storage go beyond the computation
capacity for precise and well-timed decision making". The 3 V’s of Big Data coined in 2001
turned to become 42 V’s by the end of 2017 [2]. Figure 1[3] gives the clear picture of the 3V’s of
Big Data.

Figure 1: The 3 V’s of Big Data

Earlier the data was quantized in terms of kilobytes, megabytes and gigabytes. But, today data is
counted in terms of terabytes, petabytes and Zettabytes. According to techjury 2020 [4] the data
produced by humans every day is 2.5 quintillion bytes.

Doug Laney [1], In Lee [5] describes the 3 Vs of Big Data proposed in the computing industry.

Volume refers to the amount of data an organization or an individual collects and/or generates.
The minimum threshold of big data is currently 1 terabyte which is nearly the data that would be
stored on 1,500 CDs or 220 DVDs, enough to store around 16 million Facebook photographs.
E-commerce, social media, and sensors generate high volumes of unstructured data such as
audio, images, and video.

Velocity refers to the speed at which data are generated and processed. The velocity of data
increases over time. Initially, companies analyzed data using batch processing systems because
of the slow and expensive nature of data processing. As the speed of data generation and
processing increased, real time processing became a norm for computing applications.

Volume XII, Issue X, October/2020 Page No:310


Journal of Interdisciplinary Cycle Research ISSN NO: 0022-1945

Variety refers to the number of data types. Technological advances allow organizations to
generate various types of structured, semi-structured, and unstructured data. Text, photo, audio,
video, click stream data, and sensor data are examples of unstructured data, which lack the
standardized structure required for efficient computing. Semi-structured data does not conform to
specifications of the relational database, but can be specified to meet certain structural needs of
applications. An example of semi-structured data is Extensible Business Reporting Language
(XBRL), developed to ex-change financial data between organizations and government agencies.
Structured data is predefined and can be found in many types of traditional databases.

Application of big data analytics is one promising break through. Big data analytics that is
evolved from business intelligence and decision support systems enable many organizations to
analyze an immense volume, variety and velocity of data across a wide range of networks to
support evidence-based decision making and action taking [6]. Therefore Big Data requires new
set of tools, applications and frameworks.

2. Evolution of Big Data:

The term ‘Big Data’ has been in use since the early 1990s. John R. Mashey is given the credit of
making the term ‘Big Data’ popular [7]. Big Data is not something that is completely new or
only used from last two decades. People have been trying to use data analysis and analytics
techniques to support their decision-making process from very long years back. The tremendous
increase of both structured and un-structured data sets made the task of traditional data analysis
very difficult and this transformed into ‘Big Data’ in the last decade. The evolution of Big Data
can be classified in to 3 phases, where every phase has its own characteristics and capabilities
and has contributed to the contemporary meaning of Big Data.

Phase I: Big Data originate from the domain of database management. It mostly depends on the
storage, extraction, and optimization of data that is stored in Relational Database Management
Systems (RDBMS). Database management and data warehousing are the two core components
of Big Data in the first Phase. It gives a foundation to modern data analysis and techniques such
as database queries, online analytical processing and standard reporting tools.

Phase II: From early 2000s, usage of Internet and the Web started offering unique data
collections and data analysis opportunities. Companies such as Yahoo, Amazon and eBay
expanded the online stores and started analyzing customer behavior for personalization. The
HTTP-based content on web massively increased the semi-structured and unstructured data.
Organizations now had to find new approaches and storage solutions to deal with these new data
types and analyze them effectively. In later years the growth of social media data aggravated the
need for tools, technologies and analytics techniques that were able to extract meaningful
information out of this unstructured data.

Phase III: From past decade the large scale usage of smart phones with different internet based
applications give the possibility to analyze behavioral data (such as clicks and search queries)

Volume XII, Issue X, October/2020 Page No:311


Journal of Interdisciplinary Cycle Research ISSN NO: 0022-1945

and also location-based data (GPS-data). Simultaneously, the rise of sensor-based internet-
enabled devices termed as the ‘Internet of Things’ (IoT) is making millions of TVs, thermostats,
wearable’s and even refrigerators to generate zettabytes of data every day. This incredible
growth of ‘Big Data’ now started a race to extract meaningful and valuable information out of
these new data sources. This gives origin to other new terms ‘Big Data Analytics’.

Table 1 gives the summary of the three phases in Big Data [7].

Phase-I Phase-II Phase-III


DBMS-based, structured content: Web based, unstructured content Mobile and senor based content
1.RDBMS & data warehousing 1.Infomiation retrieval and 1.Location-aware analysis
2.Extract Transfer Load extraction 2.Person-centered analysis
3.Onine Analytical Processing 2.Opinion mining 3.Context-relevant analysis
4.Dashboards & scorecards 3.Question answering 4.Mobile visualization
5.Data mining & statistical 4.Web analytics and web 5.Human-Computer interaction
analysis intelligence
5.Social media analytics
6.Social network analysis
7.Spatial-temporal analysis
Table 1: Summary of Evolution of Big Data

3. Big Data Analytics

The advanced analytic techniques that operate on big data sets are termed as Big Data Analytics
(BDA). The term itself is a combination of big data and data analytics that created one of the
most profound trends in Business Intelligence (BI) today [8].

With the evolution of technology and the increased volumes of data flowing in and out of
organizations daily, it becomes necessary to have faster and more efficient ways of analyzing
such data. Therefore, there arises a need for new tools and methods specialized for big data
analytics, as well as the required architectures for storing and managing such data [9]. Big –
Data, Analytics, and Decisions (B-DAD) framework was proposed [10] which incorporates the
big data analytics tools and methods into the decision making process. This framework maps the
different big data storage, management, and processing tools, analytics tools and methods, and
visualization and evaluation tools to the different phases of the decision making process. Hence,
the changes associated with big data analytics are reflected in three main areas: big data storage
and architecture, data and analytics processing, and, finally, the big data analysis which can be
applied for knowledge discovery and decision making.

The current and emerging focus of big data analytics is to explore traditional techniques such as
rule-based systems, pattern mining, decision trees and other data mining techniques to develop
business rules even on the large data sets efficiently. It can be achieved by either developing

Volume XII, Issue X, October/2020 Page No:312


Journal of Interdisciplinary Cycle Research ISSN NO: 0022-1945

algorithms that uses distributed data storage, in-memory computation or by using cluster
computing for parallel computation [11].

4. Applications of Big Data Analytics

The concept of big data analytics has been adapted by sectors like Telecommunication, Retail
and Finance at the early stages only and today there is no sector left untouched. The applications
of big data analytics in various sectors is discussed as follows [12]:

Figure 2: Applications of Big Data Analytics

Healthcare

Data analysts obtain and analyze information from multiple sources to gain insights. The
multiple sources are electronic patient record; clinical decision support system including medical
imaging, physician's written notes and prescription, pharmacy and laboratories; clinical data; and
machine generated sensor data. Rizzoli Orthopedic Institute in Bologna, Italy analyzed the
symptoms of individual patients to understand the clinical variations in a family. This helped to
reduce the number of imaging and hospitalizations by 60% and 30%, respectively [13].

Banking

The investment worthiness of the customers can be analyzed using demographic details,
behavioral data, and financial employment. The concept of cross-selling can be used here to
target specific customer segments based on past buying behavior, demographic details, sentiment
analysis along with CRM data.

Education

With the advent of computerized course modules, it is possible to assess the academic
performance real time. This helps to monitor the performance of the students after each module
and give immediate feedback on their learning pattern. It also helps the teachers to assess their
teaching pedagogy and modify based on the students‟ performance and needs. Dropout patterns,

Volume XII, Issue X, October/2020 Page No:313


Journal of Interdisciplinary Cycle Research ISSN NO: 0022-1945

students requiring special attention and students who can handle challenging assignments can be
predicted.

5. Opportunities and Challenges with Big Data

Privacy and Security

The personal information of a person when combined with external large data sets leads to the
inference of new facts about that person and it’s possible that these kinds of facts about the
person are secretive and the person might not want the Data Owner to know or any person to
know about them. Another important consequence arising would be Social stratification where a
literate person would be taking advantages of the Big data predictive analysis and on the other
hand underprivileged will be easily identified and treated worse.

Data Access Complexity

If data is to be used to make accurate decisions in time it becomes necessary that it should be
available in accurate, complete and timely manner. This makes the Data management and
governance process bit complex adding the necessity to make Data open and make it available to
government agencies in standardized manner with standardized APIs, metadata and formats thus
leading to better decision making, business intelligence and productivity improvements.

Storage Issues

The storage available is not enough for storing the large amount of data which is being produced
by almost everything: Social Media sites are themselves a great contributor along with the sensor
devices etc. Because of the rigorous demands of the big data on networks, storage and servers
outsourcing the data to cloud may seem an option. Uploading this large amount of data in cloud
doesn’t solve the problem. Since Big data insights require getting all the data collected and then
linking it in a way to extract important information. Terabytes of data will take large amount of
time to get uploaded in cloud and moreover this data is changing so rapidly which will make this
data hard to be uploaded in real time. At the same time, the cloud's distributed nature is also
problematic for big data analysis. Thus the cloud issues with Big Data can be categorized into
Capacity and Performance issues.

6. Tools for Big Data Analytics

Large numbers of tools are available to process big data. In this section, we discuss some current
techniques for analyzing big data with emphasis on important emerging tools namely Hadoop,
Map Reduce, Apache Spark, and Storm.

Apache Hadoop and Map Reduce

The most established software platform for big data analysis is Apache Hadoop and Map
reduces. It consists of hadoop kernel, map reduces, hadoop distributed file system (HDFS) and

Volume XII, Issue X, October/2020 Page No:314


Journal of Interdisciplinary Cycle Research ISSN NO: 0022-1945

apache hive etc. Map reduce is a programming model for processing large datasets is based on
divide and conquer method. The divide and conquer method is implemented in two steps such as
Map step and Reduce Step. Hadoop works on two kinds of nodes such as master node and
worker node. The master node divides the input into smaller sub problems and then distributes
them to worker nodes in map step. Thereafter the master node combines the outputs for all the
sub problems in reduce step. Moreover, Hadoop and Map Reduce work as a powerful software
framework for solving big data problems. It is also helpful in fault-tolerant storage and high
throughput data processing [15].

Apache Spark

Apache spark is an open source big data processing frame work built for speed processing, and
sophisticated analytics. It is easy to use and was originally developed in 2009 in UC Berkeleys
AMPLab. It was open sourced in 2010 as an Apache project. Spark lets you quickly write
applications in java, scala, or python. In addition to map reduce operations, it supports SQL
queries, streaming data, machine learning, and graph data processing. Spark runs on top of
existing hadoop distributed file system (HDFS) infrastructure to provide enhanced and additional
functionality.

Storm

Storm is a distributed and fault tolerant real time computation system for processing large
streaming data. It is specially designed for real time processing in contrasts with hadoop which is
for batch processing. Additionally, it is also easy to set up and operate, scalable, fault-tolerant to
provide competitive performances. The storm cluster is apparently similar to hadoop cluster. On
storm cluster users run different topologies for different storm tasks whereas hadoop platform
implements map reduce jobs for corresponding applications [15].

7. Conclusion:

Even a small amount of data can prove to be an asset. So, one can understand how domineering
is the big data for an organization. Big data has brought up a revolution in almost every field
whether it is related to health, marketing, entertainment or any other field involving the usage of
data or information in one or the other way. Understanding the concept of big data and its
efficient usage to increase the productivity has become a matter of concern for every small scale
and large-scale organization. This paper covers the various aspects related to big data, the factors
which made the data grow exponentially, data evolution, big data analytics and its applications.
This paper also focuses on the opportunities and challenges of big data along with the popular
tools used for big data analytics.

Volume XII, Issue X, October/2020 Page No:315


Journal of Interdisciplinary Cycle Research ISSN NO: 0022-1945

References:
1. Doug Laney, (2001). ‘3D Data Management: Controlling Data Volume, Velocity and
Variety’, Gartner, file No.949. 6 February 2001, http://blogs.gartner.com/doug
laney/files/2012/01/ad949-3D-Data-Management-ControllingData-Volume-Velocity-
and-Variety.pdf
2. https://www.elderresearch.com/blog/42-v-of-big-data
3. https://www.whishworks.com/blog/big-data/understanding-the-3-vs-of-big-data-volume-
velocity-and-variety
4. https://techjury.net/blog/how-much-data-is-created-every-day/#gref
5. In Lee (2017). ‘Big data: Dimensions, evolution, impacts, and challenges’, Business
Horizons Volume 60, Issue 3, May–June 2017, Pages 293-303, Elsevier, Science Direct.
6. YichuanWang, LeeAnn Kung, Terry Anthony Byrd (2018). ‘Big data analytics:
Understanding its capabilities and potential benefits for healthcare organizations’
Technological Forecasting and Social Change Volume 126, January 2018, Pages 3-13,
Elsevier, Science Direct.
7. https://www.bigdataframework.org/short-history-of-big-data/
8. Philip Russom (2011). ‘BIG DATA ANALYTICS’, P Russo - TDWI best practices
report, fourth quarter, 2011 - vivomente.com.
9. Elgendy N., Elragal A. (2014) Big Data Analytics: A Literature Review Paper. In: Perner
P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2014.
Lecture Notes in Computer Science, vol 8557. Springer, Cham.
https://doi.org/10.1007/978-3-319-08976-8_16
10. Nada Elgendy, Ahmed Elragal (2016). ‘Big Data Analytics in Support of the Decision
Making Process’; October 2016, pp. 1071-1084.
11. Abhay Kumar Bhadani, Dhanya Jothimani (2016). “Big Data: Challenges, Opportunities,
and Realities”, Effective Big Data Management and Opportunities for Implementation
IGI Global.
12. https://www.digitalvidya.com/blog/big-data-applications/
13. Wullianallur Raghupathi1, Viju Raghupathi (2014). ‘Big data analytics in healthcare:
promise and potential’, Health Information Science and Systems 2014, 2:3
http://www.hissjournal.com/content/2/1/3
14. Avita Katal, Mohammad Wazid, R H Goudar(2013), “Big Data: Issues, Challenges,
Tools and Good Practices” , 2013 Sixth International Conference on Contemporary
Computing (IC3), IEEE Xplore: 26 September 2013.
15. D. P. Acharjya, Kauser Ahmed P (2016). ‘A Survey on Big Data Analytics: Challenges,
Open Research Issues and Tools’, (IJACSA) International Journal of Advanced
Computer Science and Applications, Vol. 7, No. 2, 2016, pp 511-518.

Volume XII, Issue X, October/2020 Page No:316


View publication stats

You might also like