You are on page 1of 10

Insider Attacks in a Non-secure Hadoop Environment

Pedro Camacho1, Bruno Cabral2,3, and Jorge Bernardino1,2,3 ✉


( )

1
ISEC – Superior Institute of Engineering of Coimbra,
Polytechnic of Coimbra, 3030-190 Coimbra, Portugal
{a21220563,jorge}@isec.pt
2
CISUC – Centre of Informatics and Systems of the University of Coimbra,
3030-290 Coimbra, Portugal
3
FCTUC – University of Coimbra, 3030-290 Coimbra, Portugal
bcabral@dei.uc.pt

Abstract. Security is not one of the key features of big data platforms and data
security in these systems was not thought from scratch. Though it is of utmost
importance, in most systems, not even the most basic security mechanisms are
enabled and configured. Big Data systems store and process millions of confi‐
dential information from people all over the world: credit cards, addresses, health
data, financial data, etc. Apache Hadoop, one of most popular big data platforms,
stores important amounts of data that is also subject to attacks. The main problem
is that, by default, these platforms do not have active security and there is no valid
and reliable authentication model, which makes them vulnerable to internal
attacks. In this paper, we assess the importance of security mechanisms and how
they are currently configured on big data platforms. We also evaluate the impact
of encryption mechanisms in these platforms.

Keywords: Security · Insider attacks · Big data · Authentication · Hadoop

1 Introduction

The exponential growth of data in every aspect of our lives and in enterprises across the
world demands to draw value from data. In 2013, five exabytes of data were created
each day in various sizes and formats, from sensors, individual archives, social networks,
IoT (Internet of Things), and companies [1]. One of the most challenging issues is: how
to effectively manage such a large amount of data and identify new ways to analyze
large amounts of data to get value from it. Big Data technologies are a step forward in
handling this problem. The early version of the big data concept has been described in
2001 in the Gartner report by Laney [2], and big data was defined as large and complex
data sets that current computing facilities were not able to handle. It is characterized by
3Vs (Volume, Velocity, and Variety). Additionally, some new Vs have been added by
some organizations to further define big data, characteristics as “Veracity”, and “Value”
[3] brought more diffusion to the characterization of big data. With the popularity of
these systems, the repositories are increasingly likely to be stored with sensitive data
and, as usual, we need to secure it properly. There is no skepticism that new frameworks
to analyze data can provide a robust foundation for a new generation of analytics and

© Springer International Publishing AG 2017


Á. Rocha et al. (eds.), Recent Advances in Information Systems and Technologies,
Advances in Intelligent Systems and Computing 570, DOI 10.1007/978-3-319-56538-5_54
Insider Attacks in a Non-secure Hadoop Environment 529

perception, but it is important to consider security before launching or expanding a big


data platform. The complexity and variety of these systems must have a comprehensive
approach with the security of the entire big data systems [4]. Hadoop systems, by default,
are insecure, since customers are deploying them quickly without proper controls, and
this can provoke serious errors that can lead to an organizational disaster. Such systems
are particularly exposed to insider attacks. The aim of this paper, is to analyze if big data
systems administrators are concerned with security and privacy of users system. For
this, we show the results of a survey aimed at big data administrators, with some ques‐
tions that allow us to draw conclusions about the issue of safety in these systems. Addi‐
tionally, we provide foundation towards the security on big data platforms, and in
particular in Apache Hadoop, and show what an insider attacker can do when have access
to a network with a non-secure Hadoop cluster. The structure of this paper is organized
as follows. The Sect. 2 discusses some related work on security in big data platforms.
In Sect. 3 is described the Apache Hadoop platform and its security model. Section 4
presents some attacks that can be performed by an insider user, in a non-secure Hadoop
environment. In Sect. 5, we disclose the results of a survey on what platforms big data
administrators are working and if security is configured appropriately. Section 6 presents
the results of the benchmark tests performed to evaluate the performance impact with
the activation of encryption. These results help us to understand the impact of these
security measures. Finally, Sect. 7 concludes the paper and proposes future work.

2 Related Work

There are numerous publications on security in big data systems and the tendency is for
this number to increase quicker. The core justification for this interest is the fact that
security was not a priority when these systems were created. Some papers are focused
on broader topics, such as security and privacy issues of big data [5] and about insider
attacks applied to cloud systems [6], and more directed to big data systems security, in
[7] is proposed a new hardware security framework for mitigating insider attacks in big
data systems. The main goal of the framework is to detect insider attacks in big data
systems using dedicated hardware on nodes of the cluster. One of the main concerns
today involves the security and protection of sensitive information. Security breaches
from internal attackers are on the rise [8]. These attackers can have access to critical
information and corrupt, eliminate or take it from the cluster. Organizations that have
not properly controlled access to their data sets are facing lawsuits, negative publicity,
and regulatory fines, if exposed outside. Without ensuring that proper security controls
are in place, big data can easily become a big problem. According to the Ponemon
Institute, the average cost of a data breach to the targeted company in 2016 was $4
million, that represents a 29% increase in total cost of data breach since 2013 and the
average cost per list or stolen record is 158$ [9]. The risk that there could be a leak in
these big data environments, requires others responsibilities. There is a set of national
and international regulations that must be fulfilled when it comes to how enterprises
secure and analyze customer data [4]. In addition to these hard costs associated with
data breaches and security compliance, enterprises also have to contend with damage to
530 P. Camacho et al.

brand reputation when a major security incident occurs. The more data has been stored
and processed, more important is to protect it. It means that it is not sufficient to provide
effective security controls on data leaving in networks, but we also must control the
access to data within our networks. The usage of big data platforms is increasing as well
as the threats. Security and privacy issues are magnified by the volume, variety, and
velocity of big data. The diversity of data sources, formats, and data flows, combined
with the streaming nature of data acquisition and high volume create unique security
risks [10]. Apache Hadoop offers a range of configuration options when trying to secure
the platform, however, by default, all authentication, authorization and encryption is
disabled as will be described in the next section.

3 Apache Hadoop

Apache Hadoop, is a big data platform that allows users to store huge amounts of data
in safe, and reliable storage, and running complex queries in an efficient way [11].
Computation and storage run on the same set of servers in Hadoop, and both functions
are fault tolerant, which means that can be build a high-performance cluster from
commodity hardware. Hadoop is an open-source project from Apache, which is a stable
and mature product that has become amazingly popular. One of the key features of
Hadoop is the capacity of scaling. As data grow or a greater computational power is
needed, the cluster can be expanded by adding more servers. Hadoop is one of many
big data tools, and currently trending favorites such as Apache Spark and Apache Storm,
still put Hadoop ahead.
Hadoop has evolved into a highly scalable, flexible big data platform supporting
many formats of data processing workloads and analytics-focused, data-centric appli‐
cations. Security was not a priority when Hadoop was originally developed. Originally,
Hadoop was a single-application platform accessed by a select few trusted users. At its
inception, Hadoop lacked even the most basic security capabilities such as user authen‐
tication and basic permission controls. Since then, as Hadoop itself matured to handle
new types of workloads and applications, many involving sensitive data, a number of
security capabilities have been developed by both the open source and vendor commun‐
ities [12]. When Hadoop is installed, it works in non-secure mode, which means consider
network as trusted and Hadoop client uses local username. Additionally, Hadoop and
HDFS have no strong security model, in particular the communication between clients
and datanodes is not encrypted and all files are stored in clear text and controlled by
namenode. To solve these problems, some mechanisms have been added to Hadoop to
correct them. For instance, to provide strong authentication, Hadoop is secured with
Kerberos, providing mutual authentication and protection against some attacks like
eavesdropping and replay attacks (see Fig. 1). When security is enabled and Kerberos
is configured, a Kerberos principal is used in a Kerberos-secured system to represent a
unique identity. Kerberos assigns tickets to Kerberos principals to enable them to access
Kerberos-secured Hadoop services [13].
Insider Attacks in a Non-secure Hadoop Environment 531

Fig. 1. Services with (dotted line) and without (line) Kerberos enabled.

4 Hadoop Authentication Attacks

This section describes some internal attacks that can be applied to a Hadoop cluster
working with default settings. Arguably the single most damaging threat category is the
insider threat. As the name implies, the attacker comes from inside the organization and
is a regular user who has access to the network where the Hadoop cluster is located. An
insider threat represents a significant concern, because the attacker already has internal
access to the system (e.g. developer from inside organization runs a virtual machine).
The attacker can login with valid credentials, getting authorization by the system to
perform operations and pass any number security checks along the way. This can result
in a blatant attack on a system, or something much subtler like the attacker leaking
sensitive data to unauthorized users [6]. With the weak security that Hadoop provides
by default, is very easy to get unauthorized access with simple attacks. In this section,
we will demonstrate how easy is to hack a cluster Hadoop, and illustrate that it is prac‐
tically impossible to protect any data in Hadoop clusters without configuring security,
and mainly without Kerberos enabled.

4.1 Environment Setup


To show the vulnerability of these clusters without security enabled, it was set the
following setup in Table 1. The setup was in single nodes and used by different users,
it was omitted the installation process.

Table 1. Configuration of each cluster, with users and ip that belong to each Hadoop cluster.
Cluster name Users IP
Cluster1 Oliver, Mia 10.10.10.1
Cluster2 Jack, Eva 10.10.10.2

Each user, haver their own home folder in HDFS (Hadoop Distributed File System)
protected by access lists (0700), and a secret file which must not be accessible to anyone
but the owner, as showed in Table 2.
532 P. Camacho et al.

Table 2. Description of files for each user


File owner File name Content Location Permissions
(ACL)
Oliver Oliver_secret.txt Oliver secret file Cluster 1:/user/oliver rw——(0600)
Mia Mia_secret.txt Mia secret file Cluster 1:/user/mia rw——(0600)
Jack Jack_secret.txt Jack secret file Cluster 2:/user/jack rw——(0600)
Eva Eva_secret.txt Eva secret file Cluster 2:/user/eva rw——(0600)

Hadoop have WebHDFS that provides a simple, standard way to execute Hadoop
filesystem operations by an external client that does not necessarily run on the Hadoop
cluster itself. The requirement for WebHDFS is that the client needs to have a direct
connection to namenode and datanodes via the predefined ports. With this, it is possible
to perform authentication and authorization attacks, when security in not on, through an
REST API. It is also possible to have access to an Hadoop cluster by means of the Hadoop
client.

4.2 REST API Based Attack


When security is off, the authenticated user is the username specified in the user.name
query parameter when used a REST API to connect to namenode. If the user.name
parameter is not set, the server may either set the authenticated user to a default web
user, if there is any, or return an error response. Considering an attack to a Hadoop cluster
without security activated, it is possible to perform the following curl requests to get
access to private information of users in Hadoop cluster:
Description of curl command to place a request to namenode
curl -i "http://<namenode address>:<port>/webhdfs/v1/?op=<operat
ion>&user.name=<victim username>"
Access to cluster1 namenode through curl listing/user folder
curl -i “http://10.10.10.1:50070/webhdfs/v1/user?op=LISTSTATUS”
Access to cluster1 namenode and list/user/mia directory with username as mia
curl -i “http://10.10.10.1:50070/webhdfs/v1/user/mia?op=LISTSTAT
US&user.name=mia”
Access to cluster1 namenode to have access to secret file from Oliver user. Flag -l
enable the redirection of data from datanodes to namenode and to client who requested.
curl -l -i “http://10.10.10.1:50070/webhdfs/v1/user/oliver/olive
r_secret.txt?op=OPEN&user.name=oliver”

4.3 Client Based Attack


For this attack, we use a virtual machine with ip 10.10.10.3 with Hadoop client, where
we setup an environment variable with the victim user name.
Description of typical command to access HDFS in Hadoop
hdfs dfs -<command> <path>
List ‘/user’ directory
hdfs dfs -ls /user
Insider Attacks in a Non-secure Hadoop Environment 533

In this case, an attacker not belonging to the cluster needs to add another parameter
to this command “-fs” to point to the namenode that we are attacking, default port
pointing to namenode is 8020. We use the same IP to connect to namenode, but in this
case the port is different:
List ‘/user’ directory in cluster2.
hdfs dfs -fs hdfs://10.10.10.2:8020 -ls /user
hdfs dfs -fs hdfs://10.10.10.2:8020 -ls /user/eva

When performing, the commands described above, we will not be able to access
restricted files used in the cluster. Through this form of attack, it is not possible to use
the parameter ‘ user.name ‘ due to be for Web access. However, there are two other
ways to get around this situation and thus obtain the desired result. All command and
operations are being performed by a PC with a virtual machine that can be anywhere on
the network, and not belong to any of the targeted clusters.

Environment Variable. An environment variable allows an insider attack with a


Hadoop client to have access to other user files. The HADOOP_USER_NAME envi‐
ronment variable is used to find the user running the process and responds as if it were
this user to request.
Export of HADOOP_USER_NAME environment variable with mia username and
list of private file mia_secret.txt.
export HADOOP_USER_NAME=<user>
hdfs dfs -fs hdfs://10.10.10.1:8020 -cat/user/mia/mia_secret.txt

Local OS User. In this scenario, an insider attacker can just use an account from local
operating system user and connect to cluster Hadoop. This operation requires root priv‐
ileges at the local system to create users. By default, Hadoop client takes current user
name from operating system and passes it to server. With this we can create the victim
username (oliver, jack, eva) and run Hadoop client commands to connect the victim
cluster.
Commands using Hadoop client and local user name to access name node and have
access to restricted file from Jack in cluster2
sudo useradd <victim username>
su <victim username>
hdfs dfs -fs hdfs://10.10.10.2:8020 -cat/user/jack/jack_secret.t
xt

5 Susceptibility to Insider Attacks

To understand how the administrators of Big Data systems handle security, we


conducted a survey. In this survey, we ask big data administrators which platforms they
use, how many users have access to platform, if security has been enabled and config‐
ured, and if not, if they are planning to configure security. Fifteen people responded to
this survey. The survey was composed of four questions and all had to be answered
necessarily. As first question, we asked users “Which platform do you use?” and there
were the following possible answers, “Hadoop”, “Spark”, “Storm”, “Other”, the results
534 P. Camacho et al.

are represented in Fig. 2(a). In second question, we asked administrators “Your platform
is used by more than one user?”, and there were the following possible answers, “Yes”
or “No”, the results are represented in Fig. 2(b). In third question, we asked adminis‐
trators “Do you have enabled and configured the security on your platform?”, and there
were the following possible answers, “Yes” or “No”, the results are represented in
Fig. 2(c). In fourth question, we asked administrators “Are you planning to configure
security?”, and there were the following possible answers, “Yes”, “No”, “Already
configured”, results are represented in Fig. 2(d).

(a) Big data platforms (b) Big data platform used by more than one user
used by professionals

(c) Administrators (d) Security on big data systems is activated


configured the security

Fig. 2. Results of the survey.

Apache Hadoop is the most used big data platform among the big data administrators,
but Spark has grown enormously in recent years [14]. Most users who replied to the
questionnaire, claim their Hadoop cluster is used by more than one user, which makes
possible the internal attacks demonstrated in the previous section. A cluster used by
more than one user is subject to more vulnerabilities, because it gives the possibility to
other users to be able to attack the security of others. Of those who responded to the
survey, almost half say they do not have security activated and configured, which entails
serious consequences for privacy and data security of each user. Finally, after explaining
to users the seriousness of not having security activated on big data platforms, we asked
them if they plan to activate and configure security. As a result, we obtained a majority
positive response, stating that they will proceed to the security configuration.
Insider Attacks in a Non-secure Hadoop Environment 535

6 Performance Impact of Security Mechanisms

Another aspect of Hadoop security that is still evolving is the protection of data through
encryption and other confidentiality mechanisms. In the trusted network, it was assumed
that data was inherently protected from unauthorized users because only authorized
users were on the network. Since then, Hadoop has added encryption for data transmitted
between nodes, as well as data stored on disk. This section presents the results of the
tests performed with the various types of encryptions on data transferred between
Hadoop services and clients. A master and two slaves with the configuration presented
in Table 3 were used to perform the benchmarks.

Table 3. Configuration of each machine in the cluster used for benchmark tests.
Component Description
CPU Intel(R) Core(TM) i3-2100 CPU @ 3.10 GHz
RAM 4 GB
HDD 500 GB
Network Realtek Semiconductor Co., Ltd. RTL8111/8168/8411
SO Linux CentOs6 x64

Each test with different encryption types is represented with different colors, iden‐
tified in the legend on the right side. It was assumed that the maximum value for the
coefficient of variation of each set of tests could not exceed 7%.
Taking into consideration the results presented in Fig. 3, it is possible to understand the
penalization in the performance level of a Hadoop cluster with the various encryption mech‐
anisms available to protect the transfer of the data blocks between the datanodes. As a basis
for comparison, we have in the first column of each type of benchmark with different data
sizes, light blue, the average time of completion of the tests without encryption. Then, with
the bars orange, gray, yellow, dark blue and green, the encryptions 3DES, RC4, AES-128,
AES-192 and AES-256 bits, respectively. Over each result bar, with active encryption, the
penalty value is displayed against the same type of benchmark and data size, without
encryption. It is possible to analyze that in all the tests performed, 3DES encryption was the
one that most influenced the performance of the cluster, obtaining higher end-time values
with a penalty that varied between 1.22 and 3.90 times, and a gain average of 2.19 times.
3DES encryption implies that a work takes more than twice as long, in relation to not using
encryption. Regarding the best encryption algorithm, in that it has a lower performance
impact, we have the RC4, obtaining a penalty that ranged between 1.05 and 1.28 times and
an average gain of 1.14 in relation to the non-encryption of the data. Considering the average
results presented for AES encryption types (128,192, 256 bits), it is possible to state that
there is a greater performance penalty as we increase the size of the encryption key.
Regarding the type of AES-128 bits encryption, it obtained a penalty that varied between
1.08 and 1.36 times. AES-192 bits, obtained a greater penalty in relation to the
AES-128bits, in which the penalization for the use of this mechanism of encryption, ranged
between 1.08 and 1.40 times. Finally, AES-256 bits encryption obtained a greater penalty
for RC4, AES-128 bits and AES-192 bits encryption types, with a penalty varying between
536 P. Camacho et al.

1.09 and 1.46 times, compared to the base value not using encryption. Although the differ‐
ences are minimal, in small data sizes, a greater accentuation and impact on the test comple‐
tion times is observed, as we increase the size of the data, verifying this affirmation in the
various types of micro benchmark. Finally, it is possible to conclude that, since RC4 is the
encryption algorithm that has a smaller impact on the completion times of each test, it is in
a way very old and simple, being vulnerable to several types of attacks, called Stream cipher
attacks. Relatively to AES encryption, there are targeted attacks for AES-256 [15] and
AES-192 that do not exist on AES-128, however AES-256 remains an excellent alternative
to 3DES and RC4 encryption, but slower than AES-128 and AES-192 [16, 17]. The final
conclusion reached is that the AES-128bits algorithm is adequate to protect the data trans‐
ferred between the datanodes, not being an excessive penalty in the performance of the
cluster, fulfilling its function of guaranteeing data security. In the benchmarks with greater
volume of data, a more uniform characterization was observed in relation to the completion
times of each test, thus obtaining a more regular and perceptible penalty amount of each of
the encryptions used.

Fig. 3. Average completion and penalty times for each type of micro-benchmark.

7 Conclusions and Future Work

Big data systems provide a resource for organizations to improve business efficiency. But,
Big Data system rely on third-party technology to secure data. For instance, there is no
authentication in Hadoop without Kerberos. To efficiently protect Hadoop, administrator
have to enable mechanisms of authentication, authorization and encryption. Data must be
stored in a secure fashion to prevent unauthorized viewing, tampering, or deletion. Data
must also be protected in transit, because distributed systems are distributed and data is
always traveling in network. Our survey showed that administrators of big data systems
mostly assume that users of the cluster are reliable and that it is not necessary to activate the
existing security mechanisms, thus avoiding the penalty in performance imposed by such
measures. But, such penalty might be acceptable in many scenarios. For instance, 3DES
encryption, obtained a marked penalty in all the tests carried out, in relation to all other types
of encryption, taking into account the average time of conclusion of not using encryption.
Insider Attacks in a Non-secure Hadoop Environment 537

But, AES-256bits encryption only evidenced a penalty between 1.09 and 1.46, which makes
it convenient on many scenarios. For future work, we would like to test other big data plat‐
forms like Apache Spark and Apache Storm, analyzing them with security and performance.

References

1. Gunelius, S.: The data explosion in 2014 minute by minute – Infographic. ACI (2014). http://
aci.info/2014/07/12/the-data-explosion-in-2014-minute-by-minute-infographic. Accessed 3
July 2016
2. Bernardino, J., Neves, P.C.: Decision-making with big data using open source business
intelligence systems. In: Human Development and Interaction in the Age of Ubiquitous
Technology, pp. 120–147. IGI Global (2016)
3. Yin, S., Kaynak, O.: Big data for modern industry: challenges and trends. Proc. IEEE 102(3),
143–146 (2015)
4. Gaddam, A.: Securing your big data environment. In: Black Hat, USA (2015)
5. Moura, J., Serrão, C.: Security and privacy issues of big data. In: Hassan, M., Marquez, F.
(eds.) Handbook of Research on Trends and Future Directions in Big Data and Web
Intelligence, pp. 20–52. IGI Global, Hershey (2015)
6. Duncan, A., Creese, S., Goldsmith, M.: An overview of insider attacks in cloud computing.
In: Concurrency Computation: Practice and Experience, March 2014
7. Aditham, S., Ranganathan, N.: A novel framework for mitigating insider attacks in big data
systems. In: 2015 IEEE International Conference on Big Data (Big Data), pp. 1876–1885,
October 2015
8. Claycomb, W.R., Nicoll, A.: Insider threats to cloud computing: directions for new research
challenges. In: IEEE (COMPSAC), pp 387–394 (2012)
9. Ponemon Institute: Cost of data breach study: global analysis. IBM, June 2016. http://
www-03.ibm.com/security/data-breach/. Accessed 5 Aug 2016
10. MIT Technology Review Custom and Oracle: Securing the big data life cycle (2015). http://
files.technologyreview.com/whitepapers/Oracle-Securing-the-Big-Data-Life-Cycle.pdf.
Accessed 1 Sept 2016
11. Welcome to Apache™ Hadoop®! In: Apache (2014). http://hadoop.apache.org/. Accessed
10 Sept 2016
12. New approaches required for comprehensive Hadoop security. In: Dataguise, 27 February
2015. http://www.dataguise.com/new-approaches-required-for-comprehensive-hadoop-
security-3/. Accessed 10 Oct 2016
13. Apache Hadoop 2.7.2 (2016). https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/
hadoop-common/SecureMode.html. Accessed 15 Oct 2016
14. King, T.: Getting started with Apache spark: the definitive guide. In: Best Data Integration
Vendors, News & Reviews for Big Data, Applications. ETL and Hadoop (2015). http://
solutionsreview.com/data-integration/getting-started-with-apache-spark-the-definitive-
guide-2/. Accessed 25 Oct 2016
15. Li, R., Jin, C.: Meet-in-the-middle attacks on 10-round AES-256. In: Designs, Codes and
Cryptography, pp. 1–13 (2015)
16. Supriya, G.: A study of encryption algorithms (RSA, DES 3DES, and AES) for information
security. Int. J. Comput. Appl. 67(19), 33–38 (2013)
17. Hamdan, O.A., Zaidan, B.B.: New comparative study between DES 3DES and AES within
nine factors. J. Comput. 2(3), 152–157 (2010)

You might also like