Professional Documents
Culture Documents
Aditham2017 PDF
Aditham2017 PDF
1260
B. Local Analysis
Proposing the use of machine learning techniques for
prediction models is not new. For example, Allesandro
et al. [37] proposed the use of Support Vector Machines
for predicting application failure. Murray et al. [38] used
naive Bayesian classifiers for predicting failure in hard
drives. Recently, researchers from Facebook [28] used stack-
augmented recurrent networks to predict algorithm behavior.
But we believe that proposing to use LSTM for predicting
memory behavior of datanodes in order to anticipate attacks
has not been done before.
Fig. 2: Sample LSTM cell design [36]
The first step when trying to predict or detect attacks in a
system is to understand the way the system works. When a
job is submitted to the big data cluster, a resource manager
or scheduler such as yarn will schedule that job to run on
F G = σ(wf g × [ot−1 , it ] + bf g ) (1a) datanodes that host the required data. Each replica datanode
IG = σ(wig × [ot−1 , it ] + big ) (1b) is supposed to execute the same job on their copy of the data
to maintain data consistency. This replication property of a
CG = tanh(wcg × [ot−1 , it ] + bcg ) (1c) big data cluster helps in predicting an attack. By monitoring
ct = ct−1 × F G + IG × CG (1d) the memory behavior of replica datanodes and understanding
OG = σ(wog × [ot−1 , it ] + bog ) (1e) their correlation, we can predict the memory requirement of
ot = OG × tanh(ct ) (1f) a datanode in near future with some certainty.
The proposed memory behavior prediction method in-
volves running a LSTM network on the memory mapping
data obtained locally and independently at each datanode
III. PROPOSED METHOD
of the big data cluster. For this work, we chose to closely
In this section, first the objectives and assumptions in the monitor four different features of memory, as given below:
threat model are listed down. Then the two steps of our
• the actual size of memory mappings.
proposed method - local analysis & consensus are explained
• resident set (RSS) that varies depending on caching
in detail. The overall algorithmic flow of the proposed
technique.
method can be seen in Figure 3.
• private pages to characterize a particular workload.
A. Threat Model • shared pages to understand the datanode workload in
the background, especially when the job is memory
A software-centric threat model, similar to our previous
intensive.
work [4], is used that dissects the big data platform design
to understand its operational vulnerabilities. The main threats Hence, each data point in the observed memory access
we focus in this work are related to data degradation, data sequence is a vector of size four. Since the readings are taken
theft and configuration manipulation. These threats are a at regular time intervals, this data resembles a time series.
subset of insider attacks that are not typically handled in the From Equation 1, the input to our LSTM cell at time step t−1
big data world. The common link in these threats is that they is it−1 which is a data point with four values < a, b, c, d >
are directly related to compromising the data hosted by the where a is the size of the mapping, b is the size of pages
cluster. Hence, they have an obvious relation with the cluster in the resident set of the memory, c is the size of private
infrastructure and memory behavior of datanodes. Our threat pages and d is the size of shared pages. The recurrent LSTM
model does not consider vulnerabilities originated outside the network in our case is simply a single LSTM cell with a
cluster such as user and network level vulnerabilities. feedback loop. The decision reached at time step t−1 affects
Some assumptions about the system were made to fit this the decision it will reach at time step t. After a memory
threat model. data point is fed to the LSTM cell as input, the output ot at
• All datanodes use the same architecture, operating sys-
the next time step t is the predicted output. During training
tem and page size. phase, this output is compared with the actual observed value
• Replica nodes host similar data which is not necessarily
for correctness and the weights and bias of the LSTM cell are
the case in a production environment. adjusted accordingly. For this purpose, we used a standard
• The path to framework installation is the same on all
metric called Root Mean Squared Error (RMSE). RMSE,
datanodes. as given in Equation 2, is a general purpose error metric for
• The communication network will always be intact.
numerical predictions. Here, n is the number of observations
• The communication cost among replicas (slaves) is
in the sequence, yk represents a single observation and ŷ
at most the communication cost between namenode denotes the mean of the sequence so far. It is sensitive to
(master) and datanodes (slaves). small errors and severely punishes large errors. When the
1261
Fig. 3: Algorithmic flow of the proposed method.
RMSE falls above/below a predefined threshold T , we can datanodes about a possible attack, necessary prevention steps
be certain that the datanode behavior is abnormal. But an such as notifying the master node can be taken.
observed anomaly in the memory behavior at a datanode at
this stage is only a necessary but not a sufficient condition IV. EXPERIMENTAL RESULTS
to predict an attack. For this reason, we use the second This section presents our experiments and their results
step in our proposed method - consensus. The data points in three parts: the first part describes the setup of Hadoop
representing the observed and the expected memory usage cluster, LSTM network and our choice of programs to test the
for a certain time window τ is shared with other replica nodes proposed method; the next part explains the results observed
using the secure communication protocol for consensus. from the conducted experiments and; the final part is an
analysis of the results along with some observations about
n the proposed method.
1
RM SE = (yk − ŷ)2 (2) A. Setup
n
k=1
We used Amazon EC2 to set up a small Hadoop cluster
C. Consensus of 3 datanodes, 1 namenode and 1 secondary namenode.
All programs used for testing are obtained from Hadoop
We use the same framework given in our previous works map-reduce examples that come with the basic installation
[4] to host our proposed attack prediction method. This of Hadoop and run on this Hadoop cluster. Some detail
framework facilitates secure communication among datan- about these MapReduce programs are given in Table I.
odes to exchange their local analysis results, share the statis- We used a standard quad-core desktop to run Keras &
tical test results for consensus and notify the master node, if Tensor Flow frameworks required for implementing a LSTM
necessary. The framework uses encryption/decryption tech- network. Statistical tests were conducted using Matlab. The
niques such as AES along with communication protocols configuration of the Hadoop nodes and the LSTM machine
such as SSH and proposes the use of hardware security such are given in Table II.
as TPM and secure co-processors for key generation, storage First, each Hadoop MapReduce example is run on the
and hashing. cluster for a hundred times. For every iteration, the memory
When a datanode receives memory behavior data from readings are taken periodically with a time interval of five
other replica datanodes, it compares that data with its local seconds from /proc/pid/smaps which is a linux tool for
data. Both datasets are two-dimensional with predicted val- observing memory usage of processes. Average values of
ues and actual values. Each replica datanode runs a variance the four memory features - size, resident set, private pages
test such as ANOVA to measure correlation between local and shared pages are calculated and stored in an array of
data and received data. If the null hypothesis is rejected, then four-dimensional vectors. By the end of hundred iterations,
a multiple comparison test such as Tukey test is used to find we have a time series history of process behavior where each
the corrupt datanode. When a consensus is reached among data point tells us about the memory behavior of the process
1262
TABLE I: List of Hadoop MapReduce Examples TABLE II: Machine configurations used for experiments
E.No. Name Description Attribute Value
1 aggregate- An Aggregate-based map/reduce program that Purpose Hadoop LSTM
wordcount counts the words in the input files. EC2 Instance Model t2.micro N/A
2 bbp A map/reduce program that uses Processor Intel Xeon with AMD A8 Quad
Bailey-Borwein-Plouffe to compute exact digits of Turbo Core
Pi. Compute Units 1 (Burstable) 4
3 pentomino A map/reduce tile laying program to find solutions vCPU 1 N/A
to pentomino problems. Memory (GB) 1 16
4 pi A map/reduce program that estimates Pi using a Storage Elastic Block Store SATA
quasi-Monte Carlo method. Networking low N/A
5 randomtex- A map/reduce program that writes 10GB of Performance
twriter random textual data per node. Operating System Linux/UNIX Ubuntu 16.04
6 ran- A map/reduce program that writes 10GB of Software distribution Hadoop 2.7.1 Keras
domwriter random data per node. Tensor Flow
7 sort A map/reduce program that sorts the data written
by the random writer.
8 wordcount A map/reduce program that counts the words in TABLE III: Insider Attacks on a Datanode
the input files.
No. Title Description Impact
1 Mail A system admin is sending out user data Data
Attack as attachments from datanode to Privacy
personal email account.
for a specific run. This is a part of local analysis and it is
2 Config- A system admin changes the HDFS Datan-
done in parallel on every datanode. uration configuration of a datanode he/she has ode
Next, the time series data of a process i.e., one hundred Attack access to. perfor-
mance
time steps of four-dimensional data points, is fed to a LSTM
3 Log A system admin modifies user data and System
network. Our LSTM network has one hidden layer of four Deletion deletes the log files on a datanode Analysis
LSTM cells and a dense output layer of four neurons (fully Attack he/she has access to.
connected) to predict four memory features. It looks back at
one previous time step when predicting a new value and the
batch size is also set to one. We used 80% of this data for • The next set of results show the accuracy in the pre-
training the LSTM network and tested it on the remaining diction of datanode memory behavior by the LSTM
20% of data. The number of epochs varied between 50 network. These results can be observed in Figure 5 and
and 100. RMSE is used to calculate accuracy of the LSTM Table IV. Since the range of the input data is between
network for both training and testing phase. By the end of (150 − 7500), RMSE in the range of (1 − 60) can be
this step, the LSTM network of every datanode is trained considered as good prediction.
and tested on normal behavior of that datanode. • The final set of results show the efficiency of the
Finally, three different kinds of insider attacks are simu- prediction algorithm when tested on the Hadoop cluster
lated in the Hadoop cluster and datanode memory behavior with corrupt or compromised datanodes. These results
when running the Hadoop MapReduce examples is captured can be observed in Figure 5 and Table IV. Since the
again in the same way as mentioned before. The details about range of the input data is between (150−20000), RMSE
these attacks are given in Table III. For demonstration, we in the range of (1800−6700) can be considered as poor
infected the datanodes by increasing thread count allocated prediction.
for HDFS, allowing the datanode to cache and increasing
C. Analysis
the cache report and cache revocation polling time (making
the datanodes faster). Also, the datanodes were running two The Hadoop MapReduce examples used in this work
separate scripts in the background: one for sending emails represent a variety a of workloads. Programs such as Pi,
of data and another to modify logs. This data is then used BBP, Pentomino use MapReduce jobs that are compute
for testing the LSTM network that has been trained by using intensive but use very less memory. The execution time of
the same 80% data from above experiments. The challenge a single run of these programs is less than 5 seconds. Pro-
that these attacks impose is that there is no error or change grams such as Random Write, Sort, Random Text
in program behavior with respect to correctness. Write, Aggregate Word Count are memory intensive
and at the same time they represent multi-tasking. While the
B. Results Random Write, Random Text Write programs are
limited to writing data to HDFS, the Sort, Aggregate
There are three sets of results in our work. Word Count programs use more memory because they
• The first set of results show a comparison between the are both memory and compute intensive tasks. Each of
program behavior when the datanodes are not yet com- these jobs took less than 2 minutes to complete working
promised and when they are compromised, as shown in on their 1GB of input data. The datanode alternates between
Figure 4. Here, the results are shown for one datanode these two programs for each run. Finally, the Word Count
and each Hadoop MapReduce job is analyzed using four program, which is very similar to Sort, Aggregate
different features of memory. Word Count programs shows that the replica datanode
1263
Fig. 4: Comparison of Datanode 1 Memory Behavior for MapReduce Examples before and after attack (all 3 attacks are
simultaneously staged).
TABLE IV: RMSE Results from LSTM while Predicting Datanodes Memory Mappings
E.No. Epochs Stage Samples RMSE
DN1 DN2 DN3
TRAIN 78 7.49 8.33 0.95
2 50 TEST(normal) 18 4.92 6.69 1.27
TEST(attack) 57 5586.74 5496.8 6590.52
TRAIN 78 7.83 5.77 2.58
3 50 TEST(normal) 18 5.78 5.08 1.19
TEST(attack) 57 5747.34 5578.56 6684.66
TRAIN 78 6.85 25.67 24.85
4 50 TEST(normal) 18 6.42 56.28 8.74
TEST(attack) 37 6007.43 5678.18 5977.24
TRAIN 78 11.87 17.74 5.7
8 50 TEST(normal) 18 12.44 19.40 9.61
TEST(attack) 22 2063.7 1870.45 1842.91
TRAIN 158 20.93 16.73 19.38
1+5 50 TEST(normal) 38 15.33 17.12 9.57
TEST(attack) 20 6180.51 5328.91 5852.19
TRAIN 238 12.49 12.26 13.37
6+7 100 TEST(normal) 58 11.46 10.71 14.2
TEST(attack) 27 5972.78 5742.77 5545.03
behavior is not always similar to one another even if the was three times that of the other jobs and each job took
Hadoop cluster is not under attack. The input for this job almost 8 minutes to complete. This was the most memory
1264
(a) Pi (b) BBP
(e) Random Text Write + Aggregated Word Count (f) Random Write + Sort
1265
Fig. 6: LSTM Root Mean Square Error Loss Convergence
intensive job of all examples used and the size of input and V. CONCLUSIONS
output is 6GB combined.
In this paper, we proposed a method to predict (and
While the LSTM network of every datanode is trained
successively detect) an internal data attack on a Hadoop
using the data from normal runs of the programs, the testing
cluster by analyzing individual datanode memory behavior
datasets were different. One testing dataset represents normal
using LSTM recurrent neural networks. The core of our idea
behavior and the other represents compromised behavior.
is to model memory mappings of a datanode using multiple
As shown in Figure 6, the convergence in loss plateaus
features and represent that data as time series such that a
after 50/100 epochs. This shows the efficiency of our LSTM
recurrent neural network such as LSTM can predict program
model during training. The values in the test results differ
level memory requirements. When the actual memory usage
by a huge margin as shown in Test(normal) row and
differs from the predicted requirement by a huge margin,
Test(attack) row of every MapReduce program in Table
datanodes in a big data cluster share that information and
IV. It can be observed that the RMSE increased from a small
come to a consensus regarding the situation. We demon-
value in the range of 1-60 to a much bigger value in the
strated the efficiency of our method by testing it on a Hadoop
range of 1800-6700. That huge difference in the RMSE
cluster using a set of MapReduce open-benchmarks and
can be attributed to the type of attacks that were introduced.
infecting the cluster using predefined attacks.
Since the attacks improved the memory performance of
For future work, we would like to use more variations
datanodes, they made the datanodes use more memory.
of LSTM and other recurrent neural networks and test the
Though it is difficult to come up with a fixed number or
methods on a broader set of industry datasets.
range in RMSE difference to denote compromised behavior,
the idea here is to have application specific threshold, T , that
represents the acceptable range of deviation between training R EFERENCES
and testing phase in order to predict an attack successfully. [1] Ivanov, Anton, Denis Makrushin, Jornt Van Der Wiel, Maria Garnaeva,
The attacks used in our experiments are designed to explicitly and Yuri Namestnikov. “Kaspersky Security Bulletin 2015. Overall
show such huge impact. statistics for 2015.“ Securelist Information about Viruses Hackers and
Spam. AO Kaspersky Lab, 15 Dec. 2015. Web. 22 Jan. 2017.
Since LSTM time-series prediction model is statistical pre- [2] White, Tom. Hadoop: The definitive guide. “ O’Reilly Media, Inc.“,
diction, it might be insufficient for certain security techniques 2012.
[3] Zaharia, Matei, et al. “Spark: cluster computing with working sets.“
that demand 100% accuracy. Also, the proposed technique Proceedings of the 2nd USENIX conference on Hot topics in cloud
succumbs to the base-rate fallacy phenomenon. computing. Vol. 10. 2010.
1266
[4] Aditham, Santosh, and Nagarajan Ranganathan. “A novel framework [30] Schuster, Mike, and Kuldip K. Paliwal. “Bidirectional recurrent neural
for mitigating insider attacks in big data systems.“ Big Data (Big networks.“ IEEE Transactions on Signal Processing 45.11 (1997):
Data), 2015 IEEE International Conference on. IEEE, 2015. 2673-2681.
[5] Debar, Herve, Monique Becker, and Didier Siboni. “A neural network [31] Gers, Felix A., Jrgen Schmidhuber, and Fred Cummins. “Learning to
component for an intrusion detection system.“ Research in Security forget: Continual prediction with LSTM.“ Neural computation 12.10
and Privacy, 1992. Proceedings., 1992 IEEE Computer Society Sym- (2000): 2451-2471.
posium on. IEEE, 1992. [32] Hochreiter, Sepp, and Jrgen Schmidhuber. “Long short-term memory.“
[6] Mukkamala, Srinivas, Guadalupe Janoski, and Andrew Sung. “Intru- Neural computation 9.8 (1997): 1735-1780.
sion detection using neural networks and support vector machines.“ [33] Kingma, Diederik, and Jimmy Ba. “Adam: A method for stochastic
Neural Networks, 2002. IJCNN’02. Proceedings of the 2002 Interna- optimization.“ arXiv preprint arXiv:1412.6980 (2014).
tional Joint Conference on. Vol. 2. IEEE, 2002. [34] Tieleman, Tijmen, and Geoffrey Hinton. “Lecture 6.5-rmsprop: Divide
[7] Tuor, Aaron, et al. “Deep Learning for Unsupervised Insider Threat the gradient by a running average of its recent magnitude.“ COURS-
Detection in Structured Cybersecurity Data Streams.“ (2017). ERA: Neural networks for machine learning 4.2 (2012).
[8] Schultz, E. Eugene. “A framework for understanding and predicting [35] Srivastava, Nitish, et al. “Dropout: a simple way to prevent neural
insider attacks.“ Computers & Security 21.6 (2002): 526-531. networks from overfitting.“ Journal of Machine Learning Research
[9] Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. “Un- 15.1 (2014): 1929-1958.
supervised Learning of Video Representations using LSTMs.“ ICML. [36] Olah, Christopher. “Understanding LSTM Networks.“ Understanding
2015. LSTM Networks – Colah’s Blog. Github, 27 Aug. 2015. Web. 18 Jan.
[10] Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech 2017.
recognition with deep recurrent neural networks.“ Acoustics, speech [37] Pellegrini, Alessandro, Pierangelo Di Sanzo, and Dimiter R. Avresky.
and signal processing (icassp), 2013 ieee international conference on. “A Machine Learning-based Framework for Building Application
IEEE, 2013. Failure Prediction Models.“ Parallel and Distributed Processing Sym-
posium Workshop (IPDPSW), 2015 IEEE International. IEEE, 2015.
[11] Malhotra, Pankaj, et al. “Long short term memory networks for
[38] Murray, Joseph F., Gordon F. Hughes, and Kenneth Kreutz-Delgado.
anomaly detection in time series.“ Proceedings. Presses universitaires
“Machine learning methods for predicting failures in hard drives: A
de Louvain, 2015.
multiple-instance application.“ Journal of Machine Learning Research
[12] OMalley, Owen, et al. “Hadoop security design.“ Yahoo, Inc., Tech.
6.May (2005): 783-816.
Rep (2009).
[13] Neuman, B. Clifford, and Theodore Ts’ O. “Kerberos: An authenti-
cation service for computer networks.“ Communications Magazine,
IEEE 32.9 (1994): 33-38.
[14] Hu, Daming, et al. “Research on Hadoop Identity Authentication
Based on Improved Kerberos Protocol.“ (2015).
[15] Gaddam, Ajit. “Data Security in Hadoop.“ Data Security in Hadoop
(2015): n. pag. 20 Feb. 2015. Web. 10 Jan. 2016.
[16] Barham, Paul, et al. “Magpie: Online Modelling and Performance-
aware Systems.“ HotOS. 2003.
[17] Fonseca, Rodrigo, et al. “X-trace: A pervasive network tracing frame-
work.“ Proceedings of the 4th USENIX conference on Networked
systems design & implementation. USENIX Association, 2007.
[18] Mace, Jonathan, et al. “Retro: Targeted Resource Management in
Multi-tenant Distributed Systems.“ NSDI. 2015.
[19] “Apache HTrace.“ Apache HTrace About. The Apache Software
Foundation, 13 July 2016. Web. 19 Jan. 2017.
[20] Guo, Zhenyu, et al. “G2: A Graph Processing System for Diagnosing
Distributed Systems.“ USENIX Annual Technical Conference. 2011.
[21] Mace, Jonathan, Ryan Roelke, and Rodrigo Fonseca. “Pivot tracing:
Dynamic causal monitoring for distributed systems.“ Proceedings of
the 25th Symposium on Operating Systems Principles. ACM, 2015.
[22] Erlingsson, lfar, et al. “Fay: extensible distributed tracing from kernels
to clusters.“ ACM Transactions on Computer Systems (TOCS) 30.4
(2012): 13.
[23] Liao, Cong, and Anna Squicciarini. “Towards provenance-based
anomaly detection in MapReduce.“ Cluster, Cloud and Grid Com-
puting (CCGrid), 2015 15th IEEE/ACM International Symposium on.
IEEE, 2015.
[24] Wei, Wei, et al. “Exploring opportunities for non-volatile memories
in big data applications.“ Workshop on Big Data Benchmarks, Perfor-
mance Optimization, and Emerging Hardware. Springer International
Publishing, 2014.
[25] Dimitrov, Martin, et al. “Memory system characterization of big data
workloads.“ Big Data, 2013 IEEE International Conference on. IEEE,
2013.
[26] Xu, Lijie, Jie Liu, and Jun Wei. “Fmem: A fine-grained memory
estimator for mapreduce jobs.“ Proceedings of the 10th International
Conference on Autonomic Computing (ICAC 13). 2013.
[27] Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. “Multilayer
feedforward networks are universal approximators.“ Neural networks
2.5 (1989): 359-366.
[28] Joulin, Armand, and Tomas Mikolov. “Inferring algorithmic patterns
with stack-augmented recurrent nets.“ Advances in Neural Information
Processing Systems. 2015.
[29] Hecht-Nielsen, Robert. “Theory of the backpropagation neural net-
work.“ Neural Networks, 1989. IJCNN., International Joint Confer-
ence on. IEEE, 1989.
1267