You are on page 1of 10

2018 4th IEEE International Conference on Big Data Security on Cloud

Towards Security Monitoring for Cloud Analytic


Applications
Marwa Elsayed and Mohammad Zulkernine
School of Computing, Queen’s University
Kingston, ON, K7L 2N8 Canada
{marwa, mzulker}@cs.queensu.ca

Abstract— Cloud computing is empowering new innovations Analytic applications are prone to data breaches due to
for big data. At the heart, cloud analytic applications become the insecure computations, misconfiguration, and unauthorized
most-hyped revolution. Cloud analytic applications have access as a result of vulnerable, malicious, or misconfigured
remarkable benefits for big data processing, making it easy, fast, nodes/tasks [1]. The risk is further amplified due to the loss of
scalable and cost-effective; albeit, they pose many security risks.
control for running analytic applications in the cloud.
Security breaches due to malicious, vulnerable, or misconfigured
analytic applications are considered the top security risks to big Furthermore, the unique features of analytic clusters, which
data. The risk is further expanded from the coupling of data provide distributed large-scale heterogeneous computing
analytics with the cloud. Effective security measures, delivered environments, render traditional security technologies and
by cloud analytic providers, to detect such malicious and regulations ineffective. Even though, there is a lack of effective
anomalous activities are still missing. This paper presents real- security measures provided by cloud analytic providers to
time security monitoring as a service (SMaaS). SMaaS is a novel detect such malicious and anomalous activities.
framework that aims to detect security anomalies in cloud Several research efforts, proposed to fortify analytic world
analytical applications running on Hadoop clusters. It aims to
against security threats, have span different research directions,
detect vulnerable, malicious, and misconfigured applications
which violate data integrity and confidentiality. Towards
ranging from differential privacy [2], integrity verification [3-
achieving this goal, we are motivated by leveraging big data 6], policy enforcement [7-10], data provenance [11-14],
pipeline that mixes advanced software technologies (Apache NiFi, honeypot-based [15], to encryption-based [16] mechanisms,
Hive, and Zeppelin) to automate the collection, management, among others. The effectiveness of differential privacy as a
analysis, and visualization of log data from multiple sources, widespread solution is still not testified. The integrity
making it cohesive and comprehensive for security inspection. verification approaches require intercepting the application
SMaaS monitors a candidate application by collecting log data on execution for verifying result integrity, which comes at the cost
real-time. Then, it leverages log data analysis to model the of performance penalty. Access control policies cannot prevent
application's execution in terms of information flow. The misuse activities breaching data security, after an access has
information flow model is crucial for profiling processing
activities conducted throughout the application's execution. Such
been granted. Provenance mechanism incurs overhead due to
model, in turn, enriches the detection of various types of security collecting, storing, and analyzing provenance data, which can
anomalies. We evaluate the detection effectiveness and lead to impracticability. Honeypot-based and encryption-based
performance efficiency of our framework. The experiments are approaches entail modifications to the analytic applications to
conducted over benchmark applications. The evaluation results add the security attestations.
demonstrate that our system is a viable solution, yet very In this paper, we propose Security Monitoring as a Service
efficient. Our system does not make modification in the (SMaaS) framework. SMaaS is a novel information flow based
monitored cluster, nor does it impose overhead to the monitored log analysis solution for detecting security anomalies in cloud
cluster's performance.
analytic applications. It exemplifies one of the services that can
Index Terms—Big data security, log analysis, information flow be offered by Information Flow Control as a Service (IFCaaS)
control, anomaly detection. model, our previously proposed notion [17]. Inspired by
Security as a Service (SecaaS), IFCaaS expands the horizon of
I. INTRODUCTION SecaaS by featuring cloud-delivered IFC-based security
analysis and monitoring services [17].
The rapid adoption and investment in cloud analytic Hadoop, the most-shiny analytic technology, was not
applications is magnified by the 4Vs: velocity, volume, variety, originally designed with security, compliance, and risk
and veracity of big data. Such increasing momentum has given management support in mind. Recently, it is evolved to support
rise to the hype around analytic technologies to perform data- authentication and encryption mechanisms for protecting data
parallel computing on commodity hardware, empowered by at rest and in transit. Despite the evolving efforts in securing
leveraging on-demand cloud services. As the glory of cloud Hadoop, it is still exposed to weak authentication and
analytic applications grows in popularity, security concerns infrastructure attacks. Such attacks increase the security risk of
grow in importance even more. analytic applications against data confidentiality and integrity.

978-1-5386-4399-0/18/$31.00 ©2018 IEEE 69


DOI 10.1109/BDS/HPSC/IDS18.2018.00028
Figure 1. The SMaaS framework operational overview

The distinct features of computations and data in the The remainder of this paper is organized as follows: the
distributed large-scale analytic systems arise several challenges operational overview of our proposed framework, the threat
to develop an effective log analysis for anomaly detection model it assumes, and Hadoop in a nutshell are introduced in
solution. These challenges are summarized as follows: a) Section 2. Section 3 presents the details of the framework. The
handling log data that is characterized by the 4Vs and collected framework implementation and experimental evaluation are
across the cluster nodes; b) involving the complex data and presented in Section 4. Section 5 outlines the related work.
control flows enclosed among the cluster nodes to execute Section 6 draws the concluding remarks of the paper and future
analytic applications; c) considering the different roles of core work.
daemons responsible for running such analytic applications;
and d) mining for tangible evidence of security anomalies from II. OVERVIEW
log data. This section outlines the operational overview of our proposed
In this work, we propose a novel approach to boost our framework, the threat model it assumes, and a brief outline
solution in solving the aforesaid challenges. We leverage about Hadoop.
streaming data pipeline for security inspection. The data
pipeline mixes advanced software technologies (Apache NiFi, 2.1 The SMaaS Operational overview
Hive, Zeppelin) to automate the collection, management, We consider three main entities that comprise the cloud service
processing, analysis, and visualization of log data from models for this framework: Cloud analytics provider, trusted
multiple sources, making it valuable, comprehensive, and party, and consumers running data analytic applications over
cohesive for security inspection. Cluster log data is but one part the provided cluster/service.
of the whole picture. Thus, SMaaS relies also on system logs to There are different architectural deployment offerings for
complete the picture for security inspection. SMaaS works analytic technologies (e.g., Hadoop) in the cloud. These
towards extracting information flow profile from log data to offerings range from basic services (e.g., IaaS, PaaS) to
model the execution of a candidate application. Upon the specific-tailored services (e.g., Data Analytic as a Service).
information flow profile, SMaaS employs several techniques Such offerings facilitate running analytic application in the
for the detection of security anomalies. These anomalies cloud. SMaaS design supports these different offerings.
indicate data integrity and confidentiality violations. We offer SMaaS as an advanced security monitoring
Our overall contributions are as follows: 1) We propose a feature from the cloud analytics provider. As depicted in Fig. 1,
novel framework called Security Monitoring as a Service a provider for analytic technology (e.g., Hadoop) offers its
(SMaaS) for analytic applications; 2) We introduce an consumer the option to subscribe for security service (1). For
advanced approach that leverages streaming data pipeline to subscribed consumers, the provider enables collecting log data
automate log data ingestion, processing, analysis, and from the clusters respective these consumers. The provider
visualization for real-time security inspection; 3) We propose delegates the trusted party for further analyzing the collected
several techniques for detecting different types of security data (2). The trusted party, in turn, employs the proposed
anomalies based on information flow analysis; and 4) We framework to detect anomalous activities indicating data
demonstrate the detection effectiveness and performance breaches. Monitoring reports are published in the dashboard to
efficiency of our framework through a set of experiments over the consumers and email alerts are sent with detailed analysis
benchmark applications. reports upon detecting security violations (3).

70
Figure 2. The SMaaS framework components

2.2 Threat Model output. Multiple reduce tasks are commenced to process all
Analytic applications (e.g., MapReduce jobs) can be relevant intermediate pairs together in parallel to perform the
misconfigured, malicious, or vulnerable and may breach the required processing task.
security of processed data. This can be done throughout their YARN executes MapReduce application as follows: 1) the
execution via multiple activities (e.g., modifying, copying, or resource manager assigns a unique ID to the application and
deleting data) at different levels (i.e., input, intermediate copies the resources required to run it; 2) then, the resource
results, output) in a way that violates data integrity and manager starts an application master to coordinate the
confidentiality. Our solution aims to detect five anomaly types: execution of the application's tasks; and 3) the node managers
1) data leakage; 2) data tampering; 3) access violation; 4) control containers on each individual node to concurrently run
misconfigurations; and 5) insecure computation. We assume the tasks.
the correctness and integrity of log files upon which we build The infrastructure knowledge about Hadoop adds another
our security analysis solution. We also assume the security of obstacle to implementing security monitoring initiatives.
the cluster, the underlying platform, and infrastructure where Hadoop can produce various types of log data from different
SMaaS is deployed on. sources (e.g., applications, daemons, audit actions). Such log
data is considered a rich source of information for
2.3 Hadoop in a Nutshell troubleshooting and performance debugging issues. However,
In this work, we are mostly interested in Hadoop's latest it has a complex confounding structure that precludes mining a
versions (2.x and 3.x series). Hadoop stack consists of core useful knowledge for security inspections. It is a very critical
modules. These modules contain: 1) YARN, responsible for issue to derive a comprehensive profile of the behavior of an
job scheduling and cluster resource management; 2) analytic application from the emitted log data, with the goal of
MapReduce, based on YARN for parallel processing of large fostering the detection of security anomalies.
data sets; and 3) HDFS, a distributed file system for high- There is no direct mean to relate information about an
throughput access to application data. Each module consists of executed application (in YARN) and processed data (in HDFS)
several daemons. To be specific, YARN comprises of resource from log data. In a typical Hadoop cluster, each individual
manager, node manager, and job history server daemons. node/daemon generates its own log data. Logging and auditing
HDFS has many daemons such as name node, and data node, are further configured through complicated settings to expose
among others. data which can have various granularity levels (e.g., DEBUG,
An analytic application performing MapReduce job breaks WARN, INFO), reside in different storage locations, and have
the input data into multiple splits, equivalent in size to HDFS different retention and deletion policies. In this sense, logging
block. Then, the application breaks the processing into two and auditing in Hadoop can convey rapidly-growing quantities
main phases: map and reduce. The map phase is responsible to of data with low quality in terms of redundancy, heterogeneity,
map the input data into key/value pairs forming intermediate and diversity. In addition, it falls short to provide cohesive
results. Multiple map tasks are initiated on cluster's nodes to insights about user activities running analytic operations. Thus,
simultaneously process each input split. Then, the reduce phase Hadoop log data can be burdensome to mine for meaningful
takes the intermediate key/value pairs and produces the final information or tangible evidence of security anomalies.

71
Figure 3. The SMaaS data pipeline

III. THE FRAMEWORK data collection that promotes transparency and overhead-
As illustrated in Fig. 2, we build the architecture of the SMaaS efficiency. Instead of setting up a custom log collection
framework on a distributed model. Such architecture supports process, the data operator component leverages log4j1 API and
the scalability required for monitoring distributed analytic Syslog2 protocol as its workhorse to collect log data from the
applications, which may execute on a cluster spanning cluster nodes.
thousands of nodes. We specifically leverage log4j API to enable the extensive
The proposed framework comprises of four main native logging capabilities in Hadoop. Log4j API is the heart of
components: Data Operator, Data Aggregator, Security the data operator component to employ logging process for
Analyzer, and Visualization Manager. These components, Hadoop applications during the course of their execution. We
according to their essential tasks, logically form two core focus on collecting YARN application logs and HDFS data
engines: observation and inspection. To support our distributed node daemon logs in order to capture both control and data
architecture, the observation operates transparently on the flow activities relevant to an application execution. We
monitored cluster without introducing any changes or automate storing the application logs in an aggregated fashion
overhead, while inspecting the security of the monitored cluster in HDFS. In this respect, logs from all containers, allocated to
takes place separately by performing the log data analysis on a run the monitored application in the cluster distributed nodes,
different cluster. The observation engine involves the data are unified in one location ready for ingestion by the inspection
operator component acting as a transparent agent to collect log engine. The benefit of our approach is twofold: 1) it facilitates
data from each individual node in the monitored cluster. The serving and managing the collected logs directly by YARN
inspection engine entails the data aggregator, security analyzer, daemons (i.e., ResourceManager or JobHistoryServer) in
and visualization manager components. HDFS; and 2) it enables, in turn, retrieving the collected data in
The data aggregator component consumes the collected an easily-managed way.
data from the data operator. It consolidates the data to model Users are allowed to run or execute analytic applications
the execution events in the context of the whole application. and operations through command line from any node within
This component leverages novel techniques to profile the the cluster. The command line facilitates an option which
execution behavior in terms of information flow including data provides an environment to run an application under another
and control dependencies. user (specified in the command). Hadoop log data falls short in
Upon the consolidated profile, the security analyzer recording this information. It records information about the
component detects security anomalies as well as conduct specified user whose name appears in the command but not the
alerting and reporting actions. The profile is inspected against actual user who submits the command to be run under the
expected features that characterize benign applications. Such specified user's environment. In this case, a user executing a
inspection approach boosts this component to detect deviations malicious or vulnerable application can go undetected and
indicating anomalous and suspicious events. incorrectly another user can be accounted for breaching the
Monitoring reports are displayed in graphical web-based security.
frontend dashboards, managed by the visualization manager To cover this gap and achieve deeper monitoring, we
component. The components are further detailed in the light of augment our approach to log user activities from the host
the aforesaid engines in the following subsections. operating system (OS) in Hadoop nodes. The data operator
component leverages Syslog protocol to configure the host OS
A. Observation Engine to collect system logs of user command activities. Our system
supports a transparent distributed hierarchical architecture to
Data Operator collect system logs from the cluster nodes. The cluster is
The data operator component represents the observation engine
in SMaaS. It is responsible for observing the monitored cluster 1
Log4j API: https://logging.apache.org/log4j/1.2/apidocs/index.html
through auditing and logging. We devise an approach for log 2
Syslog Protocol: http://www.rsyslog.com/

72
forked into groups of nodes forming sub-clusters. A localized over the cluster's distributed nodes. The data aggregator
relay is configured to poll the system logs from each individual component consumes and parses the data in order to obtain
group of nodes (sub-cluster) and instantly forward the received every piece of information relevant to the application's
logs to be segregated into integral remote collector server, execution. Then, it transforms and consolidates these pieces of
ready for ingestion by the inspection engine. information into a single profile in a condensed format as
In this sense, our approach exposes both Hadoop and JSON (JavaScript Object Notation).
system log data transparently without requiring any intrusive
changes, installing custom agents, or introducing any overhead {
over the monitored cluster. It also supports scalability by "Application": {
"ID": "",
efficiently managing log data collection. "Name": "",
"User": "",
B. Inspection Engine "SubmitTime": "",
The data aggregator, security analyzer, and visualization "StartTime": "",
manager components embody the inspection engine in SMaaS. "FinishTime": "",
"MapsCount": "",
These components work together towards automating the "ReducesCount": "",
processing, analysis and visualization of log data, making it "avgMapTime": "",
valuable, comprehensive, and cohesive for security inspection. "avgReduceTime": "",
They employ data pipeline as an advanced log data processing "avgShuffleTime": "",
and analysis approach to reason about the security of analytic "avgMergeTime": "",
"mapClassName": "",
applications. This section starts with presenting the main idea "ControlFlow": {
of the data pipeline. Then, it dives into the detailed steps "Tasks": {
conducted by each component in the following subsections. "TaskId": "",
Our approach leverages Apache NiFi3, Hive4, and "Type": "",
"AssignedContainerId": "",
Zeppelin5 to build the data pipeline as outlined in Fig. 3. We "Node": "",
build the data aggregator and security analyzer components on "StartTime": "",
top of Apache NiFi (Niagara Files) platform. These "FinishTime": "",
components are designed as groups of processors, running in "ElapsedTime": "",
an integral workflow, inside NiFi. This design boosts our "State": "",
"DependentTaskIds": [],
system with the ability to stream log data from different "DataFlowSplit": ""
sources (Hadoop and system); ingest high volumes of data in }
real-time; consume and transform many log data formats; and },
process log data in a distributed scalable manner. These "DataFlow": {
features, in turn, empower our system with real-time security "Input": {
"DirPath": "",
detection and decision making capabilities. The aggregated "Permission": "",
data is streamed into Apache Hive to enable historical analysis. "Splits": {
The visualization manager component provides on Zeppelin "SplitID": "",
dashboard relational views extracted from aggregated data "BlkNum": "",
"Checksum": ""
residing in Hive. These visualized views are valuable to
}
determine effective responses and decisions to the detected },
security issues. "Output": {
"DirPath": "",
Data Aggregator "Permission": "",
The data aggregator component is responsible to ingest the log "Splits": {
"SplitID": "",
data, collected by the data operator, at runtime. It then "BlkNum": "",
consolidates the data to profile the execution of MapReduce "Checksum": ""
applications in terms of information flow view. }
This component utilizes a combination of NiFi standard },
processors. It starts with remotely retrieving information about "Intermediate": {
"DirPath": "",
the running/finished applications in the monitored cluster. We "Permission": ""
optimize the component to fetch log data only if the retrieved },
information is changed indicating new submitted applications. "Cache": {
For each candidate application, its log data is fetched from "DirPath": "",
"Permission": ""
YARN daemons. Recall that the fetched data represents logs
}
from all containers allocated to run the candidate application }
}
3
}
Apache NiFi: https://nifi.apache.org/
4 Figure 4. The JSON-Schema of Information Flow Profile
Apache Hive: https://hive.apache.org/
5
Apache Zeppelin: https://zeppelin.apache.org/

73
This profile represents an information flow view that IV. EXPERIMENTAL EVALUATION
models the application's execution. The JSON schema of the In this section, we present our experimental setup and
profile is illustrated in Fig. 4. The profile models the evaluation results. Our evaluation targets answering two
application's execution from two angles: global and partial. The questions: 1) what is the effectiveness of SMaaS for detecting
former captures global attributes about the application such as security anomalies in MapReduce applications; 2) what is the
id, name, start time, finish time, average map time, etc. performance efficiency of our system. The following sections
Furthermore, attributes detailing the data flow processed by the describe the experiments conducted to assess each question,
application broken down into input, output, and intermediate respectively.
data are also tracked. Such attributes are driven directly from We set up our experiments over private cloud consisting of
HDFS and nodes where the data is stored. The latter portrays six VMs. Each VM has "Ubuntu 14.04", 16 GB RAM, 4 CPUs
the application's processing activities (Map and Reduce tasks) and 100 GB storage. We choose Hortonworks6 data platform
and dependencies (data and control) between them. For each distribution to build the monitored Hadoop cluster. We build
Map task, the profile captures the data split that flows as input our system over a NiFi cluster using Hortonworks data flow
to the task. Similarly, for each Reduce task, the profile shows distribution. Each cluster has one master and two slave nodes.
the list of dependent Map tasks that flow input data to the We managed both clusters via Ambari7 server.
reduce task and the data split that flows as output from the The experiments are conducted over three popular
reduce task. MapReduce benchmark applications [18]: TeraSort, TeraGen,
The last step is streaming the constructed profile into and WordCount. They are shipped with Hadoop distribution.
Apache Hive. This step facilitates incrementally storing the TeraSort aims to sort data stored in files. We used TeraGen to
profiles of applications, which ran on the monitored cluster, generate data as input to Terasort with size of 1GB, 5GB, and
into a unified database. This database enables performing 10GB. WordCount aims to count word occurrences in files.
historical analysis by security administrators.
A. Anomaly Detection
Security Analyzer
We assess the detection effectiveness of our solution over
The security analyzer component utilizes the information flow the aforesaid five types of anomalies. We crafted the code of
profile received from the data aggregator component to detect the WordCount application to implement data leakage, data
anomalies. tampering, access violation, and misconfigurations types. For
This component is designed as a combination of NiFi the insecure computation type, we change the permission of the
standard processors. We implement the security analysis logic default locations with open access. At the end, we have six
as scripts that automatically execute inside the processors. The versions of WordCount applications, where five of them are
analyzer reasons about five anomaly types: 1) data leakage; 2) malicious, vulnerable, and misconfigured. We applied our
data tampering; 3) access violation; 4) misconfigurations; and solution that detected all anomalies with 100% accuracy. In
5) insecure computation. The analyzer takes from the what follows, the detection techniques, employed by the
information flow profile as a baseline for the security analysis. security analyzer, are explained as regards each anomaly type.
It checks the profile against expected features that govern
1) Data Leakage
benign applications. Any deviation from the expected features
indicates a security anomaly. The analyzer also involves A malicious application may copy input or output data to
correlating information from daemon logs and Syslogs upon unauthorized location. To detect data leakage activity, the
needed to identify a compromise. These logs are collected by analyzer reasons about the "BlkNo" attribute of all splits of the
the data operator and ingested by the data aggregator to be input and output dataflow in the application's profile. Then, it
ready for this component. The analyzer sends email alerts to analyzes the datanode daemon logs looking for unauthorized
the security administrator with detailed analysis reports upon write operations of any block of data which is not related to the
detecting security violations. The analysis techniques to detect expected MapReduce control flow. The information is
the anomalies are further explained in Section IV. correlated based on the "StartTime" and "BlkNo" attributes in
the profile. An occurrence in the datanode logs indicates the
Visualization Manager data leakage activity. Only MapReduce related-activities are
The visualization manager component provides a expected from a benign application.
dashboard as an important feature of our monitoring solution. By knowing the compromised data split, the analyzer can
We leverage Apache Zeppelin for the visualization dashboard. further track the affected map and reduce tasks by traversing
The dashboard augments security administrators with the the profile. In case of input leakage, the "DataFlowSplit"
ability to visualize the aggregated information flow profile as attribute can lead to the affected map task. From there, we can
well as relational views about the aggregated data from Hive infer the reduce task having dependency with it from linking its
database. This is important to determine effective responses to "TaskId" with the reduce task's "DependentTaskIds". In case of
security issues. For example, an administrator can query about output leakage, the "DataFlowSplit" suffices to identify the
other applications executed by a malicious or a victim affected map task.
user/node involved in a detected security violation. As a
remedy, the administrator may block or isolate activities from 6
Hortonworks products: https://hortonworks.com/
this user/node until proper mitigation actions are conducted. 7
Apache Ambari: https://ambari.apache.org/

74
(a) (b)
Figure 5. I/O performance evaluation over various data size of the SMaaS components: (a) data aggregator; (b) security analyzer.

2) Data Tampering 5) Insecure Computation


A malicious application may replace input or output data A vulnerable application may execute while the
with wrong data. To detect data tampering, the analyzer intermediate data, application files, or data blocks are stored in
reasons about the "DirPath" attributes of the input and output the locations that have open access. In this sense, the
dataflow in the application's profile. Then, it analyzes the application is prone to insecure computations due to the risk of
Syslogs to reason about the input and output paths that are tampering with its processing in terms of data and control
configured when a user submitted the application. The flows. The analyzer detects insecure computations by
information is precisely correlated based on the "SubmitTime" examining the "Permission" attribute of the intermediate and
and "mapClassName" attributes from the profile. Mismatch cache dataflow in the profile. Such data is expected to have
between configured and processed paths indicate the data restrictive access.
tampering activity. This is because a benign application is
expected to process the input and output data as configured by B. Performance
the user. We conducted several experiments to evaluate the performance
3) Access Violation of SMaaS as streaming analytic solution. The experiments
A vulnerable application may be submitted to process input exercise different aspects including I/O performance, execution
at or produce output to unsafe locations. In addition, a time, and resource usage. They are designed to appraise the
malicious application may change the access permissions of SMaaS performance from two levels: component and system.
the input or output making them prone to security compromise. Component-level experiments focus mainly on the data
Access violation is detected by reasoning about the generator and security analyzer, the two main components that
"Permission" attributes of input and output dataflow in the involve processing streaming data in SMaaS. System-level
profile. Permission, specifying that the input or output paths experiments reflect the overall performance of SMaaS, hosted
have open access, indicates a violation. This is because a on NiFi cluster.
benign application is expected to process and produce data that 1) Component-level performance
has restricted access to authorized users only. This section presents the experiments employed to measure a)
4) Misconfigurations I/O performance and b) execution time.
Throughout the execution of an application, Yarn stores the We employ customized reporting tasks inside NiFi to send
application's files inside local cache. The cache specifically the collected metrics about each component to Grafana8. The
contains the work directory of each individual container metrics are measured during five minutes rolling window. The
assigned to execute the application's tasks. It also stores the I/O performance is measured in terms of BytesRead and
intermediate results throughout the application's execution. On BytesWrite metrics. To estimate the execution time, we use the
the other side, input and output data processed by an TotalTaskDurationSeconds metric.
application typically is stored in HDFS. The datanode daemons These metrics can be defined and measured as follows:
store data blocks in local directories. The location of the local o BytesRead: The total number of bytes that the component
cache and directories are normally configured at the time of the read from the disk during the rolling window.
cluster set-up. o BytesWrite: The total number of bytes written by the
A misconfigured application may change the configurations component to the disk during the rolling window.
to different locations than the typical expected ones. o TotalTaskDurationSeconds: The total time that the
Misconfiguration violation is detected by checking the component used to complete its task during the rolling
"DirPath" attribute of cache and intermediate dataflow in the window.
profile and linking them with the values of the corresponding
properties in Hadoop's configuration file.
8
Grafana: https://grafana.com/

75
(a) (b)
Figure 6. Execution time evaluation over various data size of the SMaaS components: (a) data aggregator; (b) security analyzer.

Figure 7. The SMaaS CPU utilization evaluation

Figure 8. The SMaaS memory consumption evaluation

We perform two different experiments to evaluate each size of the processed data. The aggregator's performance
component. Firstly, we assess the aggregator component to whereas is dependent on the data volume. Yet, it hits an
analyze different volume of application logs. We execute efficient range between 127 and 224 MB.
TeraSort and TeraGen benchmark applications over various
data volumes (1 GB, 5 GB, 10 GB). These various volumes b) Execution Time
consequently result in increasing the volume of each Fig. 6 part (a) and part (b) illustrate the execution time of the
application's logs that need to be analyzed. On the flip side, we aggregator and the analyzer over different data loads,
exercise the analyzer component in processing various respectively. The execution time has a non-linear increase as
workloads (1.2 GB, 6.9 GB, 15 GB) of Syslog and datanode the volume of data grows. We notice that the execution time of
log data. We created the workloads by combining the logs the analyzer is higher than the aggregator. The main reason is
throughout a period of month. that the analyzer executes diverse algorithms for the security
inspection that entail processing time. Both components are
a) I/O Performance highly efficient as their execution does not exceed 1 sec.
The BytesRead and BytesWrite for varying data quantities for Thanks to the proposed data pipeline that boosts our system to
both the aggregator and the analyzer are shown in Fig. 5 part efficiently support data-parallel computing
(a) and part (b), respectively. As inferred from the figures, the
performance involves proportional growth with the increasing 2) System-level performance
size of data. The data aggregator involves more I/O operations This section highlights our experiments to gauge two aspects:
in contrast to the security analyzer. The latter performs the a) CPU utilization and b) memory consumption of our system.
analysis on the fly, whereas; the aggregator requires interacting Monitoring the resource consumption is not supported through
with the disk while preparing the information flow profile. NiFi reporting tasks. Thus, we implement our own monitoring
Thus, the performance of the analyzer is not affected by the component as a group of NiFi processors.

76
The monitoring component leverages NiFi API to fetch logical layer of operation, and the mode of checking. In
system diagnostic report about the SMaaS cluster. The report general, they require intercepting the computation of
captures the heapUtilization and processorLoadAverage MapReduce tasks for verifying result integrity, which comes at
metrics. Then, the monitoring component publishes the metrics the cost of performance penalty.
after refinement into Grafana through AMS API9. Some approaches enforce security policies for access
Recall that, SMaaS runs on top of NiFi. As NiFi executes control such as GuardMR [7] and Vigiles [8] at different
within a Java Virtual Machine (JVM) on the host VMs, the granularities by modifying the underlying platform [8]; or
SMaaS resources are limited to the CPU capacity and memory adding an extra access control layer [7]. Access control policies
space afforded by NiFi. In this sense, the SMaaS components cannot prevent misuse activities breaching data security, after
share the same resources dedicated to the JVM. We perform an access had been granted. An approach [9] pays attention to
the experiments in real-world settings. SMaaS is an online propose IFC-based access control model that supports multi-
solution, thus it continuously runs while we execute 12 tenancy in SaaS systems. IFC endorses advancement over
versions of the benchmark applications over various data size access control as it can provide end-to-end protection. As an
and anomaly types. alternative enhancement, accountability mechanisms are
proposed to harden access control policies. AccountableMR
a) CPU Utilization
[10] incorporates such enhancement at MapReduce. The
Fig. 7 shows the average processor utilization of our accountability is achieved by verifying that data access
system. As noticed, the utilization varies throughout the happened after authorization is in compliance with the security
experiment time. It does not exceed 15% and reaches an policies, governing the data security.
average of 8%. Having our system on NiFi to support data- Data provenance (or lineage) mechanism is typically used
parallel processing diminishes the impact on the CPU usage. to keep history about data for the purpose of reproducibility.
b) Memory Consumption Recently, few approaches embrace such mechanism for big
data security [11-14]. One approach [11] proposed formal
Fig. 8 shows the percentage of heap memory consumed by perception about provenance mechanism to enable forward and
SMaaS during the experiments. The average consumption is backward tracing of data during the execution of MapReduce
37% and the maximum reached is 46%. The average used heap tasks. Other approaches [12-14] perform data provenance by
memory is 3.7 GB, the maximum heap memory used is 4.6 analyzing metadata information and system log files to collect
GB, and the 95th percentile is 4.2 GB. As observed from the traces about data processing for the purpose of detecting
results, our system achieves an efficient CPU utilization and anomalies. Such mechanism faces several challenges that may
memory consumption footprint. hinder its practicability such as the volume of captured
V. RELATED WORK provenance data, the storage and integration required to
effectively analyze these data, and the most important factor is
Several research efforts, proposed to fortify analytic world the overhead incurred from collecting these data during the
against security threats, have span different research directions, execution of distributed analytic tasks.
ranging from differential privacy [2], integrity verification [3- Other approach [15] takes on honeypot-based mechanism
6], policy enforcement [7-10], data provenance [11-14], to detect unauthorized access in MapReduce. A different
honeypot-based [15], to encryption-based [16] mechanisms, approach [16] leverages encryption to protect data stored at
among others. Interested reader can find the further details of spark. Encryption mechanism may disrupt the typical
the intrusion detection systems in the cloud in a recent survey operations within the system when data is being processed.
[19]. Furthermore, encryption and decryption are costly operations
Airavat [2] applied differential privacy to protect data from that may impose performance burden and reduction of system
malicious MapReduce jobs. Despite the fact that differential operations too.
privacy mechanism recently attracted researchers as an
effective solution in specific problem contexts, its effectiveness VI. CONCLUSION
as a widespread solution is still not testified. Integrity In this paper, we introduce a novel Security Monitoring as a
verification is a mechanism applied for decades, which recently Service for cloud analytic applications. Our system provides an
appears in the light of MapReduce to examine the security of advanced online defense in depth for analytic applications. The
results produced by MapReduce jobs. benefit is two-fold: 1) it hardens the security of analytic
SecureMR [3] and TrustMR [4] rely mainly on replication- clusters (e.g. Hadoop) by inspecting applications running over
based computations, while VIAF [5] extends the mechanism to them and 2) it protects big data processed by analytic
incorporate query-based approach to further detect colluding applications by detecting anomalies that breach its security.
attacks. IntegrityMR [6] performed the integrity checks at The SMaaS framework is offered as a security monitoring
application layer along with MapReduce task layer. These service from the cloud analytic provider. The provider puts a
approaches differ in the scope of integrity assurance, the trusted party in charge for analyzing log data collected from the
monitored cluster. The trusted party, in turn, employs SMaaS
9
Ambari Metric Service (AMS) API: to detect security anomalies.
https://cwiki.apache.org/confluence/display/AMBARI/Metrics+Collector+AP
I+Specification

77
By intelligently leveraging streaming big data pipeline and [5] Y. Wang and J. Wei, "VIAF: Verification-based integrity
cloud technologies, SMaaS gains benefits to elude the assurance framework for mapreduce," Proc. of the IEEE Int.
challenges originate in analytic applications. The SMaaS Conf. on Cloud Computing, (CLOUD), pp. 300-307, 2011.
pipeline automates the collection, management, processing, [6] Y. Wang, J. Wei, M. Srivatsa, Y. Duan, and W. Du,
analysis, and visualization of log data. The framework extracts "IntegrityMR: Integrity assurance framework for big data
analytics and management applications," Proc. of IEEE Int.
information flow profile to model the execution of the analyzed
Conf. on Big Data, (BigData), pp. 33-40, IEEE, 2013.
application. The profile captures both control and data
[7] H. Ulusoy, P. Colombo, E. Ferrari, M. Kantarcioglu, E. and
dependencies across the distributed tasks/nodes running the
Pattuk, "GuardMR: fine-grained security policy enforcement for
analyzed application. Several techniques are employed for the MapReduce systems," Proc. of 10th ACM Symposium on
detection of security anomalies based on analyzing the Information, Computer and Communications Security, pp. 285-
information flow profile. Our solution checks the profile 296, ACM, 2015.
against expected features that govern benign applications. Any [8] H. Ulusoy, M. Kantarcioglu, K. Hamlen, and E. Pattuk,
deviation from the expected features indicates a security "Vigiles: Fine-grained access control for mapreduce systems,"
anomaly. The system helps security administrators to take Proc. of IEEE Int. Con. on Big Data, (BigData), 2014.
mitigation actions when alerted about discovered anomalies. [9] N. Solanki, W. Zhu, I. Yen, F. Bastani, and E. Rezvani, "Multi-
In this sense, our system conceals the log processing for tenant Access and Information Flow Control for SaaS," Proc. of
security inspection behind the analytic cluster's scene. It does IEEE Int. Conf. on Web Services, (ICWS), pp. 99-106, 2016.
not require any intrusive changes, installing custom agents, or [10] H. Ulusoy, M. Kantarcioglu, E. Pattuk, and L. Kagal,
introducing any overhead over the monitored cluster. It has the "AccountableMR: Toward accountable MapReduce systems,"
following implications: a) handling log data that is Proc. of 2015 IEEE Int. Conf. on Big Data, (Big Data), pp. 451-
characterized by the 4Vs and collected across the cluster nodes; 460, 2015.
b) involving the complex data and control flows enclosed [11] R. Ikeda, H. Park, and J. Widom, "Provenance for generalized
among the cluster nodes to execute analytic applications; c) map and reduce workflows," Proc. of 5th Biennial Conf. on
Innovative Data Systems Research, (CIDR'11), California, USA,
considering the different roles of core daemons responsible for
2011.
running such analytic applications; and d) mining for tangible
[12] E. Yoon and A. Squicciarini, "Toward detecting compromised
evidence of security anomalies from cluster and system logs. mapreduce workers through log analysis," Proc. of 2014 14th
We implement, deploy, and evaluate SMaaS in private IEEE/ACM Int. Symposium on Cluster, Cloud and Grid
cloud. Our experiments validate the detection effectiveness and Computing, (CCGrid), pp. 41-50, 2014.
performance efficiency of our framework. We conduct our [13] C. Liao and A. Squicciarini, "Towards provenance-based
experiments over Hadoop-based benchmark applications. The anomaly detection in MapReduce," Proc. of 2015 15th
results demonstrate that our system attains high detection IEEE/ACM Int. Symposium on Cluster, Cloud and Grid
accuracy of five different anomaly types. Yet, it achieves high Computing, (CCGrid), pp. 647-656, 2015.
performance with a pretty lightweight footprint on resource [14] O. Alabi, J. Beckman, M. Dark, and J. Springer, "Toward a data
utilization. spillage prevention process in Hadoop using data provenance,"
Proc. of the 2015 Workshop on Changing Landscapes in HPC
ACKNOWLEDGMENT Security, pp. 9-13, ACM, 2015.
This research is partially supported by the Natural Sciences [15] H. Ulusoy, M. Kantarcioglu, B. Thuraisingham, and L. Khan,
& Engineering Research Council of Canada (NSERC) and "Honeypot based unauthorized data access detection in
MapReduce systems," Proc. of 2015 IEEE Int. Conf. on
Canada Research Chairs (CRC). Marwa Elsayed thanks the
Intelligence and Security Informatics (ISI), pp. 126-131, 2015.
Schlumberger Foundation for supporting her Ph.D. study in
[16] S. Shah, B. Paulovicks, and P. Zerfos, "Data-at-rest security for
Canada. Spark," Proc. of 2016 IEEE Int. Conf. on Big Data, (Big Data),
pp. 1464-1473, 2016.
REFERENCES
[17] M. Elsayed and M. Zulkernine, "IFCaaS: Information Flow
[1] Cloud Security Alliance, "Big Data Security and Privacy Control as a Service for Cloud Security," Proc. of 2016 11th Int.
Handbook: 100 Best Practices in Big Data Security and Conf. on Availability, Reliability and Security, (ARES),
Privacy," 2016. Salzburg, Austria, 2016, pp. 211-216. doi:
[2] I. Roy, S. Setty, A. Kilzer, V. Shmatikov, and E. Witchel, 10.1109/ARES.2016.27
"Airavat: Security and Privacy for MapReduce," Proc. of the 7th [18] D. Eadline, "Hadoop 2 Quick-Start Guide: Learn the Essentials
USENIX conference on Networked systems design and of Big Data Computing in the Apache Hadoop 2 Ecosystem,"
implementation, (NSDI'10), USENIX Association, Berkeley, Addison-Wesley Professional, 2015.
CA, USA, pp. 297-312, 2010.
[19] M. Elsayed and M. Zulkernine, "A Classification of Intrusion
[3] W. Wei, J. Du, T. Yu, and X. Gu, "SecureMR: A Service
Detection Systems in the Cloud," IPSJ Journal of Information
Integrity Assurance Framework for MapReduce", Proc. of the
Processing, vol. 23, no. 4, pp. 392-401, 2015.
2009 Annual Computer Applications Conf., 2009.
[4] H. Ulusoy, M. Kantarcioglu, and E. Pattuk. "TrustMR:
Computation integrity assurance system for MapReduce." Proc.
of IEEE Int. Conf. on Big Data, (Big Data), pp. 441-450. IEEE,
2015.

78

You might also like