You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/342765594

Journal of Critical Reviews CLOUD COMPUTING AND BIG DATA: A


COMPREHENSIVE ANALYSIS

Article · July 2020


DOI: 10.31838/jcr.07.14.32

CITATIONS READS

3 758

2 authors:

Edwin Sweetline Priya Suseendran G.


Madras Christian College VELS INSTITUTE OF SCIENCE, TECHNOLOGY & ADVANCED STUDIES (VISTAS), CHE…
8 PUBLICATIONS 14 CITATIONS 149 PUBLICATIONS 884 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Suseendran G. on 01 August 2020.

The user has requested enhancement of the downloaded file.


Journal of Critical Reviews
ISSN- 2394-5125 Vol 7, Issue 14, 2020

CLOUD COMPUTING AND BIG DATA: A COMPREHENSIVE ANALYSIS


E. Sweetline Priya1 and G.Suseendran2
1 Department of Computer Application, Madras Christian College, Chennai, India. Sweetlinepriya.edwin@gmail.com
2 Department of Information Technology, Vels Institute of Science, Technology and Advanced Studies (VISTAS), Chennai,
India. suseendar_1234@yahoo.co.in

Received: 09.04.2020 Revised: 11.05.2020 Accepted: 06.06.2020

Abstract
The Internet plays an important role in the modern world. With the increasing usage of internet and web-based applications,
various enterprises turn to big data solutions to cope up with the changing demand. In this big data era, data is the most valuable
entity as it is mainly used by businesses for future prediction and decision making. With a substantial increase in the scale of big
data, storing and managing the same is a great challenge. However, technologies such as cloud computing offer cost-effective, well-
suited and consistent on-demand services that can be adopted by industries for big data and analytics. This paper aims to
investigate various features, technologies available in big data and cloud computing, how cloud computing is used in par with big
data. Also, the challenges faced in a cloud environment and big data and why the cloud is needed for big data is been discussed.
Finally, some of the future directions for our research is been highlighted.

Keywords: Cloud computing, Big data, Hadoop, NoSQL, Data analytics

© 2020 by Advance Scientific Research. This is an open-access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
DOI: http://dx.doi.org/10.31838/jcr.07.14.32

INTRODUCTION Yang et al. [1], discussed various sources of big data such as
In recent years there is a tremendous increase in the usage of earth sciences, internet of things, social networks, astronomy,
the internet globally. The Internet already changed the way we business and industry. They have given an elaborated detail on
live. Digital information rules the world. With the significant what are the big data technology challenges and cloud
increase in usage of the internet, the amount of data generated computing methodologies and tools used to address these
has already exceeded 2.5 quintillion bytes per day. Almost all challenges. Some of the cloud computing features that can be
sectors offer online services. Most financial transactions are adopted for big data challenges are on-demand provisioning of
done digitally through online banking, Google pay, BHIM etc. In resources, automatic scheduling of resources and scalability
a fraction of second, there are millions of message, audio, video with cloud VMs.
communication happens in Whatsapp. Tiktok is an app that
produces numerous videos and it is been seen and liked by Pyne et al. [2] have discussed different big data analytical
people in the entire world. Companies and Industries opt for techniques, namely descriptive, predictive and prescriptive
the Internet of Things (IoT) for monitoring the physical analytics. Regression and visualization come under descriptive
equipment. IoT transforms several enterprises including analytics. In regression, data variables are correlated using
manufacturing, transport, oil and gas companies etc. The charts and in visualization, data are deeply visualized to
continuous rise in the volume of data produced by these understand and find the mean value. Predictive analytics is
organizations can be called as big data due to (i) rate of mainly used in the retail sector for analysing the customer
creation of data (ii) structured and unstructured nature of data behaviour with past purchase history and the page they
(iii) amount of data generated and so on. Big data rules various navigated and hence can increase the profit. Optimization,
sectors such as healthcare, manufacturing industries, smart- simulation and numeric modelling are the subcategories of
cities and social media such as Facebook, Whatsapp, prescriptive analytics. They have also discussed some of the
Instagram, twitters and many more. The most challenging part massive data applications in real-time, and on the current
here is an accommodation of data in servers as the rate of issues with big data analytics in their paper.
growth of data exceeds the capacity of the companies and Namasudra et al [3] have discussed various features of cloud
difficulty in managing the same with existing infrastructure. computing, what are the advantages and disadvantages of
Hence cloud computing comes into the picture to balance this cloud, an overview of service and deployment models. Also,
issue with its pay as you go, model, virtualisation concept, they have highlighted issues in the cloud concerning
elasticity, data processing in parallel and mainly the data confidentiality, availability, access control, storage and data-
security. related issues, policy issues and several security-related issues
Cloud computing is a very successful implementation of such as a denial of service attack, cookie poisoning, migration
service-oriented architecture (SOA) mainly designed for attack, encryption attack, DNS & sniffer attack and the
computations which are of massive and complex and is well malware attacks.
suited for big data. The purpose of this study is, describing Misal and Perumal [4] have discussed de-duplication methods
various features and technologies available in the cloud and for efficiently storing data on the cloud. De-duplication is
big data and finding the relationship between them and mainly nothing but, eliminating repeated data for multiple users who
discussing the challenges and opportunities involved in the has the same file content. In this paper, they have proposed a
cloud and big data together. methodology for data level, file-level and block-level de-
duplication of data while uploading and file retrieval in the
RELATED WORK cloud. They have experimentally done de-duplication process
Generally, businesses turn to new technologies to run long- with Rabin-Karp algorithm and interpreted their results and
time business and increase profit. Big data analysis helps them have concluded that duplication is high for image, text and
to identify customer needs and provide better service. Cloud audio files. Video files are comparatively having less
computing is a popularly known technology adopted for big duplication in the content.
data analysis because of its flexibility and capacity. Hence it is
a key research area for academicians and industrialists. Kaur et al [5], highlighted some energy efficiency methods for
servers to save the energy consumption such as server

Journal of critical reviews 185


CLOUD COMPUTING AND BIG DATA: A COMPREHENSIVE ANALYSIS

consolidation, Dynamic Voltage Frequency Scaling Scheduling Amazon Web Services (AWS), Microsoft Azure, Google Cloud,
(DVFS), Thermal Aware Scheduling and Workload Aware Alibaba cloud, IBM cloud[21].
Scheduling. Server consolidation is done using work loading
some of the physical servers while switching off the rest of the Characteristics of the cloud are as follows:
unused servers. DVFS is mainly for lowering the consumption • Virtualization: It refers to running many virtual machines
of power. In Thermal aware scheduling, allocation of servers is on a single physical system.
done based on servers’ past temperature records. In workload • Scalability: The services provided by cloud vendor can be
aware scheduling, the incoming workload is allocated to the scaled out and scaled in based on user’s demand.
resources, based on the present word load of the resource. • Availability: Cloud can provide high-availability of
data/services which zero chance of data loss as data is
Siddiqa et al [6], have given a detailed study on various big replicated in other servers.
data management levels. They also discussed big data mining • Security: Security is the main focus for any cloud
techniques such as classification and prediction. They also provider as both provider and user cannot comprise with
have proposed a method for big data mining and management. data leakage.
Machine learning, statistical analysis, data mining and visual • Performance: Cloud provides high-performance
analysis are some of the big data analysis methods available. computation with its distributed and high-speed cluster
Big data security concerning privacy, integrity, confidentiality systems.
and availability has been explained in a good manner. To • Maintenance: Cloud servers can be easily maintained.
ensure confidentiality, several encryption algorithms such as There are three main service models available in cloud
RC4, Triple DES, RC2, Twofish, Blowfish, Rijndael have been namely, Software-as-a-service (Saas) [7], Platform-as-a-
tabulated. service (Paas) and Infrastructure-as-a-service (Iaas).
CLOUD COMPUTING IN A NUTSHELL Below table (Table 1) gives a comparison of the service
Cloud computing is defined as on-demand provisioning of models. Figure 1, simply explains the cloud architecture.
software, hardware, infrastructure and other resources based Cloud service can also be defined as anything-as a service
on the pay-per-use model [5]. It is an internet-based model (Xaas).
and hence one should possess high-speed internet connection
to use the cloud. Some of the cloud providers in the market are

Figure 1: Cloud computing architecture

Some other services [8] provided by the cloud are listed below • Model as a service (Maas): Here predefined or
: customized models are offered as service for analytics.
• Data as a service (Daas): This is a service for providing • Big data as a service (BDaas): Data analysis and future
data to the cloud clients through the network. prediction are done by cloud providers. Mainly used for
• Analytics as a service (Aaas): Clients can facilitate this business.
feature based on their business need.

Table 1: Comparison of cloud services


Feature Saas Paas Iaas
Cloud provider offers: various software Platforms such as operating Storage, network, hardware,
applications to clients who systems and middle-wares computing capacity
wish to utilize the same
without installation in their
local system
Responsibility of the user Developing the application Installing software, Controlling the operating
or documentation with the databases on top of the system and maintain the
provisioned software and provisioned platform and installed software and
maintenance of data in the work on it, databases and work on it
application.
Maintenance by the cloud Low Medium High
user
Cost Low Medium High
Flexibility Low Medium High

Journal of critical reviews 186


CLOUD COMPUTING AND BIG DATA: A COMPREHENSIVE ANALYSIS

Examples: Office 365, Dropbox, Cisco Google App Engine, AWS Google compute engine,
Webex, Salesforce Elastic Beanstalk. Amazon EC2.

Deployment models of the cloud are private, public, file block creation, replication and deletion based on an
community and hybrid [3]. instruction from Namenode. Hadoop is designed based on the
• Private cloud: Private cloud is solely owned by a single open-source model called the map-reduce model.
organization. Management and maintenance of this can
be done within the organization or by a trusted third- Map-reduce is a software-framework [12] for writing
party. An example is university/college which maintains applications to process a vast quantity of data in-parallel on
the students, faculty, course details and managing clusters. As per the name, it has two main jobs namely map
examinations, marks of the students. and reduces jobs. Map tasks are taken in parallel and the
output of the map is given as input to reduce job. This is
• Public cloud: This type of cloud is fully owned by the
mainly used for scheduling the tasks. Other data management
cloud service provider (CSP) who takes care of storage,
tool provided by Hadoop is HBase [13], which is a distributed
network, and server. The client pays for their usage of
database based on column-model and can make use of high-
resources. An example is, some business maintain their
performance data warehouse with less expensive hardware,
data in public cloud due to the heavy volume of data it
enabling the HBase to respond to the current demand of cloud
maintains which cannot be accommodated in their own
storage and internet applications.
organizations due to lack of systems, servers etc. The
responsibility of the CSP is regular backup, availability,
Spark
consistency, accuracy and security of data.
Spark is an advanced framework for data processing and
• Community cloud: Some organizations/institutions may
analytics of huge dataset on clusters. It is an alternative to
have the same set of goals and requirements which can
map-reduce. With spark, parallel can be done very fast with in-
be shared through a community cloud.
memory primitives. Spark provides APIs in Java, Impala and
• Hybrid cloud: This type of cloud is nothing but the
Python. It is an integral part of SMACK for providing Paas
combination of the private and public cloud. More
service for predictive analytics and real-time personalization
sensitive data may be stored in private and other data
processing [14] for big data.
can be stored in public. There should be a central
managing authority to control the flow of data and
NoSQL ( Not Only SQL)
communication among private and public cloud through
The relational Database Management system provides limited
secured access control methods.
capabilities for storage and linkage of big data and RAID
mechanisms of RDBMS are not enough for unstructured and
BIG DATA
varied nature of big data. NoSQL database meets the Big Data
Any data that is beyond the processing capability of RDBMS
criteria with high tolerance, performance, accuracy and
databases are referred to as Big data [9]. Due to the increasing
scalability as compared with conventional database resources.
volume in nature, it cannot be accommodated in a single
Another reason for the popularity and preference of NoSQL
machine. Big data is generated from many digital sources,
database is because of its flexible nature of data models. There
including internet, e-mails, mobile phones, youtube,
are four main data models of NoSQL [15] which are explained
WhatsApp, GPS, RFID tags, blogs, social networks, IoT sensors.
below, Figure 2 shows the typical data models of NoSQL.
The features of big data are listed as volume, variety, velocity,
1. Key-Value stores: In the key-value store, each value will
veracity, variability which are referred as 5V’s [2] Volume
be associated with a key and using a key, the values can
refers to the amount of data collected from various data
be retrieved. Query retrieval speed is comparatively high
sources (measured in terms of petabytes and zettabytes).
with relational database systems.
Variety refers to data types such as text, audio, video, web and
Ex: Redis, Flare
activity logs which are in the form of structured, semi-
2. Document stores: In this type of store, for each key
structured or unstructured [22]. Velocity refers to speed or
instead of value, the document will be associated. The
rate at which data gets generated. Veracity is the accuracy or
document will not follow any schema instead it can have
truthfulness of the data. Variability refers to inconsistency in
any data types. It is mainly used for big data storage and
data flow concerning time, demand, activities, events etc.
better query performance. Examples are MongoDB and
CouchDB.
Big data technologies
3. Column-oriented stores: Here data are stored as tables
Several Big Data technologies have been developed for
and each table will be having several records and each
analysing a large dataset. In this section, we list out some
record is identified with a key. Although it follows the
essential Big Data technologies.
traditional storage method, data compression, parallel
processing is better. Examples are HBase, Cassandra,
Hadoop
Bigtable etc.
Hadoop is an open-source [10] framework designed for the
4. Graph database: As derived from graph theory, graph
scalable, reliable and distributed computing environment. It is
database uses graph as its data model. The graph follows
a java-based framework that supports the processing of large
mathematical concepts, which has several vertices and
data sets. Hadoop utilizes clusters of nodes (server) for
links. Vertices represent a set of objects and the links to
processing several terabytes of information. Hadoop
join these objects in the database. This is a completely
Distributed File System, abbreviated as HDFS is a distributed
different model comparing to the above-described
file system designed mainly used by applications with large
models. It is suitable for social network applications,
data sets. HDFS follows master/slave framework [11]. An
dependency analysis and pattern recognition [16]. For
HDFS cluster typically has one Namenode and several
instance, in a social network such as Facebook, it is used
Datanodes. Namenode serves as a master node that regulates
to connect the friends of a particular person, who are the
access to the files in clients. It is used for managing the files
friends of friends and based on this friend’s suggestion
and directories with opening, closing and renaming. Datanode
can be given.
is located in each node of the cluster and takes care of the
storage of the nodes they run on. Datanode also takes care of

Journal of critical reviews 187


CLOUD COMPUTING AND BIG DATA: A COMPREHENSIVE ANALYSIS

Figure 2: NoSQL data models

IMPORTANCE OF CLOUD IN BIG DATA MANAGEMENT


Cloud computing and big data are interconnected [17] with 2. Data storage
one another in several aspects. Big data management is a very Data storage is a challenge with big data due to its volume,
complex process as the data is originated from heterogeneous velocity and variety. As traditional databases support schema-
sources. Big data management comprises of various levels [6] based tables, accommodation of big data in traditional storage
such as data transmission, data storage, data pre-processing, is problematic. Designing a flexible and efficient database
indexing, classification and prediction and finally decision system is challenging [19] due to the following reasons:
making. Figure 3 illustrates the various levels of big data • Scalability
management. In this section, we present various challenges • Reliability
faced in managing big data and how cloud computing helps to • Persistent storage
address the issues. • Efficiency
1. Data transmission • Cost of maintenance
Data transmission consists of stages [1] such as (i) Data
collection data from data sources such as sensors, RFID tags Cloud can address all the above issues. Generally, cloud
(ii) Integration of data collected from various data centres servers are expanded horizontally. Horizontal scalability
located in different geographic locations. Data compression means, adding several nodes/servers in a cluster to increase
methods are employed to reduce data size before network storage quickly based on the requirement of the business. For
transmission and cloud storage. Some of the data compression instance, AWS provides three types [20] of storage to address
techniques suggested by [18] are lossless compression, Null big data issues. They are object storage (Ex. Amazon S3), file
compression, Run-length compression, Adaptive Huffman storage (Amazon Elastic File system-EFS) and block storage
coding and Lempel Ziv algorithms, DCT (discrete cosine (Ex. Amazon Elastic Block Store- EBS).
transform) and Spatiotemporal compression.

Figure 3: Big data management

3. Data pre-processing Hence it is important to clean the data before processing. In


Data collected from data sources have to be verified first data cleansing, inaccurate and noisy data and missing data are
before data processing. There are chances for incomplete and eliminated from the dataset.
inconsistent data which may lead to wrong data analysis.

Journal of critical reviews 188


CLOUD COMPUTING AND BIG DATA: A COMPREHENSIVE ANALYSIS

4. Data processing challenges,” Int. J. Digit. Earth, vol. 10, no. 1, pp. 13–53,
Big data processing requires dedicated high-speed computing 2017.
resources with increased CPU speed, storage and network. 2. S. Pyne, B. L. S. Prakasa Rao, and S. B. Rao, “Big data
Cloud offers virtually unlimited resources which can be analytics: Methods and applications,” Big Data Anal.
provisioned on-demand. Methods Appl., pp. 1–276, 2016.
3. S. Namasudra, P. Roy, and B. Balusamy, “Cloud
5. Data analytics computing: Fundamentals and research issues,” Proc. -
Data analytics is a process of extracting hidden and useful 2017 2nd Int. Conf. Recent Trends Challenges Comput.
information and patterns by exploring big data. With the Model. ICRTCCM 2017, pp. 7–12, 2017.
identified patterns, decision making can be done by 4. R. Misal and B. Perumal, “Data deduplication for efficient
organizations. Big data analytics can be classified [2] as cloud storage and retrieval,” Int. Arab J. Inf. Technol., vol.
16, no. 5, pp. 922–927, 2019.
(i) Descriptive analytics: In this analytics, historical data 5. A. Kaur, V. P. Singh, and S. Singh Gill, “The future of cloud
are mined to find potential patterns. Typically this computing: Opportunities, challenges and research
analytics answer the question “What has happened?”. trends,” Proc. Int. Conf. I-SMAC (IoT Soc. Mobile, Anal.
Examples are finding customer buying preference, Cloud), I-SMAC 2018, no. March, pp. 213–219, 2019.
most frequently sold products etc. 6. A. Siddiqa et al., “A survey of big data management:
(ii) Predictive analytics: With this analytics, various Taxonomy and state-of-the-art,” J. Netw. Comput. Appl.,
organizations can understand what has happened in vol. 71, pp. 151–166, 2016.
the past and “What will happen in the future?”. This is 7. W. Iftikhar, “A study on Cloud Computing issues and
what all the organizations would like to know, as it is challenges in higher education institutes of Middle
useful in predicting the future. Examples are sales Eastern countries,” vol. 6, no. October, pp. 894–902,
forecasting, weather forecasting and now COVID-19 2018.
prediction. 8. N. Zanoon, A. Al-haj, and S. M. Khwaldeh, “Cloud
(iii) Prescriptive analytics: The answer to the question Computing and Big Data is there a Relation between the
“How can we make it happen” can be done with Two :,” vol. 12, no. 17, pp. 6970–6982, 2017.
prescriptive analytics. It is mainly used in the health 9. R. Nambiar, R. Bhardwaj, A. Sethi, and R. Vargheese, “A
care sector. look at challenges and opportunities of Big Data analytics
Some of the big-data analytics tools provided by various cloud in healthcare,” Proc. - 2013 IEEE Int. Conf. Big Data, Big
vendors are Oracle Analytics cloud, Cloudera, Zoho Analytics Data 2013, pp. 17–22, 2013.
and Microsoft Power BI. 10. “Apache Hadoop,” Apache. [Online]. Available:
https://hadoop.apache.org/. [Accessed: 25-May-2020].
6. Data security 11. “HDFS Architecture Guide,” Apache. [Online]. Available:
As big data is been stored in remote servers, it poses security https://hadoop.apache.org/docs/r1.2.1/hdfs_design.htm
challenges as the data owner has limited control over the l. [Accessed: 25-May-2020].
physical (virtual) storage. Also, the data owner does not know 12. “MapReduce Tutorial,” Apache. [Online]. Available:
where the physical server is located. Hence cloud provider https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.
should ensure the security of the data of the clients, as both html. [Accessed: 25-May-2020].
cloud service providers and the consumer cannot comprise on 13. Y. Bao, L. Ren, L. Zhang, X. Zhang, and Y. Luo, “Massive
any security breaches. Some of the mechanisms to ensure sensor data management framework in cloud
security are the proper access controls, encryption methods manufacturing based on Hadoop,” IEEE Int. Conf. Ind.
and the monitoring tools. Some of the AWS tools to provide Informatics, pp. 397–401, 2012.
data security are KMS, CloudHSM, Vault and some of the 14. [S. Ullah, M. D. Awan, and M. Sikander Hayat Khiyal, “Big
monitoring tools are CloudWatch, CloudTrail, Splunk etc. Data in Cloud Computing: A Resource Management
Perspective,” Sci. Program., vol. 2018, 2018.
CONCLUSION AND FUTURE OPPORTUNITIES 15. J. Han, E. Haihong, G. Le, and J. Du, “Survey on NoSQL
This paper gave an overview of the use of a cloud environment database,” Proc. - 2011 6th Int. Conf. Pervasive Comput.
for big data management. We discussed various technologies, Appl. ICPCA 2011, pp. 363–366, 2011.
tools available in the market for both cloud and big data. Also, 16. K. Grolinger, W. A. Higashino, A. Tiwari, and M. A. M.
the advantages and challenges faced by big data on the cloud is Capretz, “Data management in cloud environments:
been discussed. Most of the sectors such as health-care, social NoSQL and NewSQL data stores,” J. Cloud Comput., vol. 2,
networks, geospatial sector, agriculture, banking, no. 1, 2013.
entertainment already have adopted big data technologies. Use 17. S. Mehla, A. Chaudhary, and R. Kumar, Recent Advances in
of cloud computing to implement big data technologies lowers Computational Intelligence, vol. 823. Springer
the burden of in-house computing power and maintenance by International Publishing, 2019.
companies. Industries that focus more on business and 18. Kaur, S. and Kaur, A., “A review on Data Compression
customers can opt for cloud computing technologies for Techniques in Cloud Computing”. International Journal of
storage of their data as they need not worry about the Computer Engineering in Research Trends, 351(5),
maintenance of data and resources. Even with advancements pp.2349-7084, 2015..
in technologies, there are few areas where big data and cloud 19. H. N. Dai, R. C. W. Wong, H. Wang, Z. Zheng, and A. V.
still faces challenges. Cloud data security, lack of Vasilakos, “Big data analytics for large-scale wireless
standardization among various cloud service providers, networks: Challenges and opportunities,” ACM Comput.
scheduling and energy – efficiency of cloud resources are some Surv., vol. 52, no. 5, 2019.
of the areas of concern. Data integration from various nodes in 20. Chu, E.C.P., Wong, J.T.H.Subsiding of dependent oedema
cluster, indexing and searching of data and deduplication of following chiropractic adjustment for discogenic
data to solve storage issues are issues about big data. sciatica(2018) European Journal of Moleculr and Clinical
Medicine, 5, pp. 12-15. DOI: 10.5334/ejmcm.250
In future, research can be carried out in above-mentioned 21. Mahalakshmi, B., and G. Suseendran. "Effectuation of
challenges to bring better-proven solutions. It is very much secure authorized deduplication in hybrid cloud." Indian
appreciated if the IT professionals and researchers explore Journal of Science and Technology 9, no. 25 (2016): 1-7.
new and efficient ideas in big data management and cloud 22. Poovarasi, J. S. T. M., Sujatha Srinivasan, and G.
computing for better future. Suseendran. "Optimization-Based Effective Feature Set
Selection in Big Data." In Intelligent Computing and
REFERENCES Innovation on Data Science, pp. 49-58. Springer,
1. C. Yang, Q. Huang, Z. Li, K. Liu, and F. Hu, “Big Data and Singapore, 2020.
cloud computing: innovation opportunities and

Journal of critical reviews 189

View publication stats

You might also like