Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number
ABSTRACT The IoT (Internet of Things) connect systems, applications, data storage, and services that may
be a new gateway for cyber-attacks as they constantly offer services in the organization. Currently, software
piracy and malware attacks are high risks to compromise the security of IoT. These threats may steal
important information that causes economic and reputational damages. In this paper, we have proposed a
combined deep learning approach to detect the pirated software and malware infected files across the IoT
network. The TensorFlow deep neural network is proposed to identify pirated software using source code
plagiarism. The tokenization and weighting feature methods are used to filter the noisy data and further, to
zoom the importance of each token in terms of source code plagiarism. Then, the deep learning approach is
used to detect source code plagiarism. The dataset is collected from Google Code Jam (GCJ) to investigate
software piracy. Apart from this, the deep convolutional neural network is used to detect malicious infections
in IoT network through color image visualization. The malware samples are obtained from Maling dataset
for experimentation. The experimental results indicate that the classification performance of the proposed
solution to measure the cyber security threats in IoT are better than the state of the art methods.
INDEX TERMS Internet of Things, Data Mining, Cyber Security, Software Piracy, Malware detection
I. INTRODUCTION that may be shared using cloud infrastructure. The IoT enabled
The Internet of Things (IoT) is the interconnection of physical technologies can be used to develop smart cities, education
moving objects “Things” through internet embedded with an system, e-shopping, e-banking, maintain our health, manage
electronic chip, sensors, and other forms of hardware. Each industry, and to entertain and protect human beings [3].
device is uniquely identified globally by Radio Frequency
Identifier (RFID) tags. These smart objects communicate with The IoT devices can be used for an open attack due to always
available on the network. The industrial IoT-cloud can be easily
Other connected nodes and can be monitored and controlled targeted by malware infection and pirated software for harmful
remotely [1]. IoT offers pervasive connectivity to a usage and to compromise security [4, 5]. The software piracy is
widespread range of smart physical objects, service industries, the development of software by reusing source codes illegally
cloud computing services, and applications. IBM stated that from someone else’s work and disguise as the original version.
the number of connected devices through the internet is The cracker may copy the logic of the original software by
expected to increase up to 50 billion by 2020 [2]. It will reverse engineering procedures and then design the same logic
increase the number of communication networks with the in another type of source codes [6].. t is a severe threat to internet
connection of smart objects as well as the amount of big data security which gives access to unlimited downloads of pirated
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access
software, open source codes and, promotes and advertises of string from executables. The byte sequence technique is a static
pirated versions. It rapidly increases each year and gives huge based analysis that can be used to extract n-bytes sequences
economic loss to the software industry [7]. The Business from these patterns. The functional call graph is a static analysis
Software Alliance (BSA) 2016 report stated that the public method used to extract the structural analysis of codes [16].
software piracy ratio is approximately 39%, which results in A. Challenges:
business damages up to 52.2$ billion every year. Many types of
Software Piracy: Currently, every third installed software
research have shown that every software contains plagiarized
application is pirated. The intellectual digital property and
source codes in the context of logic with a range of 5% to 20%
authorship rights of software are not physical due to globally
[8, 9]. The intelligent software plagiarism techniques are
accessible on the internet and thus, hard to sustain [17]. The
required to catch the plagiarized source code in pirated software.
attacker may crack the original software and re-design the logic
Several plagiarism detection techniques are proposed, i.e., clone
into another type of programming language. It is really difficult
detection, source code similarity identification, software bugs
to catch the crackers’ malicious activities in cross-domain
analysis, and software birthmark investigation [10, 11]. These
source codes because each programming language has different
techniques mainly are structure and text-based analysis. The
syntax and sematic structures. There are some tools available
structure-based technique examines the basic structure of source
that can translate one type of source code to another type, i.e.,
codes, syntax trees, graph behavior, and function call graph of
MoHCA-Java [6, 18]. The widely available open source
sub routines. Because of this, it is limited to a specific
projects made it easy for the cracker to copy the original idea
programming language structure. So, if a cracker reuses the
and design their software. One cannot get paid by his ideas, but
logic of the original software to another type of programming
for the ability to offer the solutions into real-world.
language, then, it is hard to catch due to different structure
behavior. The industrial IoT cloud services may be used to Malware detection: The conventional methods may solve
secure and protect smart devices by designing intelligent code obfuscation concerns, but high computational cost is
software plagiarism and malware detection techniques. needed regarding texture feature mining using malware
visualizations. These types of feature extraction procedures do
Currently, malicious attacks are more easily abrupt due to the not perform well with large malware data analysis. Currently,
growing number of IoT networks. The malware attacks are malware is constantly generating, update, and manipulate that
usually planned to infect the privacy of IoT nodes, computer makes the detection more challenging [19, 20]. The proposed
systems, and smartphones over the internet. Several scanning malware detection method tries to response to the following
techniques are proposed to detect windows based malware by queries:
using specific signatures. The malware identification analysis is • How to identify malware with reduced overhead?
divided into two main methods, i.e., static and dynamic based • How to extract malware features with less computational
analysis. The dynamic analysis learns the patterns of malware cost?
files while executing code in a real-time virtual environment. • How to process big malware datasets to get better
Several types of research have shown that malicious behavior accuracy?
can be observed by function calls, function parameters analysis, In this paper, we proposed a combined deep learning approach
information flow, instruction traces, and visual analysis of to identify the pirated and Malware attacks on industrial IoT
codes. There are automated online tools available that can be cloud. The TensorFlow deep neural network is designed to
used to investigate the dynamic behavior of malicious codes. capture the pirated software using source code plagiarism.
i.e., CW Sandbox, Anubis, and TT analyzer. This method is Further, the deep convolution neural network is designed to
more time consuming due to monitoring every dynamic capture the malicious patterns of malware through binary
behavior of source codes [12-14]. The static malware analysis visualization. The combined solutions of the proposed
methods do not need the real-time execution of source codes. It approach are much promising in terms of classification
may be used to capture the format information of malware performance. The remaining paper as organized follows.
executable binaries. The signature-based malware identification Section II contains a literature review, section III includes the
techniques are static based, i.e., control flow graph, opcode proposed methodology, the results and discussions are given
frequency, n-gram, and string signature. The disassembling in section IV, and finally, the conclusion is given in section V.
tools, i.e., IDA Pro and OllyDbg [15], are applied to uncover the
II. LITERATURE REVIEW
executables before implementing static based algorithms. These For efficient threat detection in the IoT environment, more
disassemblers are used to extract the hidden patterns from binary studies have been conducted on malware detection and
executables. Then, these patterns are used to retrieve encoded
2
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access
software plagiarism detection to achieve high Identification them based on machine learning. They classified a dataset
performance, and reduce time cost. consists of 1444 worm files, and 1330 benign files and results
Several studies have observed the impact of different aspects showed that classification accuracy of 96% [28] proposed an
of software plagiarism detection. Most of the previous work is approach based on N-Opcode sequences and applied machine
complete on a single programming language. It means that, if learning using SVM classifier. They attained an accuracy of
a cracker alters the logic of the source code to other data 98% in malware detection. Opcodes are the machine code
structure in the similar programming language, then the mnemonics, whereas the cosine similarity and critical
present literature is valuable. In [21], the software benchmark instruction sequence techniques were used for the malware
was used to detect the threat in java source codes. It selected identification process. Results showed that every version of
the control flow information to capture the structural similar malware family share some common core signatures
characteristics from the running source codes. The and can be captured using core API calls sequences. It [29]
benchmarks of two source codes were compared to compute proposed an obfuscation scheme to check the limitations of
the similarity. In [22], the author selected a hybrid approach static analysis approaches. The experimental results showed
to extract the similarity features from intermediate code that the static analysis independently is not sufficient for
generation based on compiler infrastructure. effective analysis of malware samples. Static analysis
Additionally, the unsupervised learning approach was used to approaches can easily be avoided if the malware is packed or
measure similarity in source codes. The similar functionalities obfuscated. Therefore, some robust behavioral features are
of different source codes were used to detect plagiarism. In required in the analysis. Such types of malware detection
[23], the author proposed a Source Forager search engine that approaches are usually called dynamic analysis approaches.
was used to extract like functionalities between C and Dynamic analysis approaches are usually based on the
C++code in response to user questions. The developed search execution of binary samples in a controlled environment to
engine retrieved different properties from code and proceeds extract features within a virtual machine. In [30], the author
in the shape of k functionalities from the corpus. The logic was proposed a technique to automate the manual process of
the conceptual structure of the program, and it could detect malware analysis. They identify the malicious behavior within
software similarity. In [24], the logic-based approach was the program, which was not previously shown by a benign
used to compute the behavior of dissimilarity in two source application. This approach gives a short amount of
codes by the same inputs. If there was no dissimilarity, then it information regarding malicious behavior. Therefore, it can
originated in plagiarism problem. The symbolic execution and provide significant insights for understanding malware. In
preconditions reasoning was used to capture the semantics for [31], the author proposed a useful clustering based approach
dissimilarities from execution paths. In [25], the Latent which can automatically scale a huge amount of malware
Semantic Analysis (LSA) was used to identify plagiarism in samples into group/classes of clusters based on their execution
students' assignments. The LSA was combined with PlaGate behavior. They also extended the Anubis system for additional
to inspect the semantic resemblance between source codes network analysis and tainted tracking to generate automated
documents. Apart from this, the author described how truce report for the selected malware samples. Their extended
different code chunks were important concerning semantics. system was able to characterize different activities of the
A syntax tree from any source code was drawn based on the program more abstractly. To overcome the issues of
parse tree. It could be used to compare different source codes computation time and limitations of static approaches, the
based on their syntax tree. In [26], the parse tree kernel hybrid approaches are introduced to improve malware
method was used to extract similar text between java source detection systems. In [32] collectively used both static and
codes. This process did not provide improved outcome dynamic features extracted from malware samples to train a
because of the irregular variations of nodes in core malware classifier and named their approach OPEM. They
functionalities. The fingerprinting method was designed to used frequency of occurrence of operational codes as static
extract the resemblance between sources. features while execution traces of executable files, system
Several studies have conducted static, dynamic, hybrid, and calls, exceptions as dynamic features to train their
visualization analysis for malware detection in the past. The classification model. The results showed that the hybrid
detailed summary of each type of analysis is presented. Static approach performs much better as a combined approach rather
analysis techniques usually consist of features extraction by than running of static and dynamic approaches separately.
static means from binary files using binary data extraction Many studies have been directed for visualization of malware
tools. In [27], uses a sequence of variable length instructions features to improve the performance of classification results as
and classify worms from the binaries of benign files to classify well as a reduction in time, size, and resource overhead. For
3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access
example, [33] introduced a CNN and image-based malware because it generates exponential values using color images.
classification method. Their model achieved 98.52% The deep learning algorithms outperform with big malware
accuracy. They randomly separated only 10% samples in a datasets as these types of methods can use filters to decrease
family for testing. [34] designed a deep learning model for noise automatically. Thus, the use of color images generates
malware detection. Their model obtained 98% accuracy for better results using deep learning techniques.
9339 samples. In [35], the author proposed a CNN based The conversion of malware binary file to color image
malware classification method. Their approach acquired comprises of four phases. First, the hexadecimal strings (0-15)
94.5% accuracy. are produced from raw binary files. Second, a hexadecimal
stream is divided into a chunk of the 8-bit vector where each
III. PROPOSED METHODOLOGY FOR CYBER 8-bit segment is measured as an unsigned integer (0-255).
SECURITY THREATS AND PROTECTION MEASURES IN Third, the 8-bit vector is subsequently converted into a two-
INDUSTRIAL IOT USING BIG DATA
dimensional matrix space. Fourth, each 8-bit integer generates
from two-dimensional space is plotted with of red, green, and
In this study, we propose an architecture model for cyber
blue shaded colors. Fig.1 presents the complete phases of data
security threats and protection measures in industrial IoT, as
preprocessing section.
shown in Fig. 1. Four databases are deployed in cloud storage
of the proposed architecture model. The database one stores
2) DEEP CONVOLUTIONAL NEURAL NETWORK
the raw network traffic; the database two contains a list of
The Deep Convolutional Neural Network (DCNN) is
previous datasets, and the database three stores the new
proposed to conduct in-depth malware data analysis. The
signatures of detected malware attacks. The cracker stores
DCNN contains five modules, as shown in Figure 1. The input
pirated software in database 4 by IoT devices. A massive
layer is used to receive training images for the designed neural
amount of data processing needs a high time cost. The raw
network model. First, the convolution layer is used to decrease
data files in database one are passed through the pre-
the noise and give better signal characteristics. Second, the
processing data module. The pre-processing data is further
pooling layer is used to decrease the data overhead preserving
sent to the detection module. The detection module captures
useful information. Third, the fully connected layer is used to
malware and pirate software attacks by learning from the
convert the two-dimensional array into the one-dimensional
signatures in databases two and four. If any malicious activity
and then input it to the specific classifier. Fourth, the malware
is observed in the network, then warns the system
families from respective images are identified by using the
administrator to take suitable action.
classifier.
A. MALWARE THREAT DETECTION 3) CONVOLUTION LAYER
1) DATA PREPROCESSING The meaningful features are extracted by reducing the image
The color images are generated from raw binary files to of parameters using convolution layer. Convolution layer, i.e.,
transform the malware detection problem using image interpretation invariance, rotation invariance, and scale
classification problem. It differentiates the proposed research invariance. It decreases the over-fitting problem and gives the
from state of the art methods [19, 35, 36], in which malware generalization concept to the main architecture. The input of
binary files convert into a grayscale image with 256 colors. the convolutional layer is several maps, as shown in equation
This method does not depend on any reverse engineering tool 1 [37].
such as disassembler and decompile. The color images can
retrieve better features as compared to grayscale images with æ ö
only 256 colors. x lj = f ç å x lj-1 * kijl + blj ÷
ç iÎM ÷
Further, the better features of malware images can outperform è j ø (2)
in the classification of malware families. Before, numerous
malware detection methods based on machine learning
M kl
Where j denotes the cluster of given maps; ij defines
algorithms [19, 20] offered better outcomes using grayscale convolution kernel, used for joining the ith input feature map
images. The color images are transformed into grayscale
bl
visualization, and then feature extraction techniques were used with jth output feature map; j is the bias consistent to the ith
to classify malware type. The classification performance is feature map, and is the activation function.
improved using feature reduction methods to decrease the 4) POOLING LAYER
features’ set. The results showed that machine learning The sub-sampling layer term is generally used for pooling
algorithms are not a better choice for malware detection layer which offers two modes of pooling, i.e., maximum and
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access
average pooling. It is not disturbed by backward propagation preprocessing methods are used to break the source codes in
and used to minimize the consequence of image distortion. It small pieces for deep analysis. It converts the codes into
also decreases the features’ factor and increases the proposed meaningful information and removes noisy data. The data is
DCNN functioning, as shown in equation 2. cleaned used from undesirable information, i.e., special
(
x lj = f down ( x lj-1 ) + blj
(2)
) symbol, constants, stop words. Then, the tokenization process
is used to transform the cleaned data into useful tokens. The
Where down (.) performs a pooling task, and b represents stemming, root words and frequency constraints are used to
bias value. mine more useful features in the preprocessing phase. Then,
Fully connected layer the weighting techniques are used to zoom the contribution of
each token. The TFIDF and Logarithm of Term Frequency
The fully connected layer is used to classify the output of
(LogTF) are used in the weighting phase [38, 39].
pooling layer. Every neuron links with the preceding neuron
Mathematically TFIDF is defined as in Equation 4.
with a corresponding fully connected layer. This layer is used
to improves the model generalization competency by
decreasing the overfitting issue. tfidf (t,d,D) = tf (t,d) × idf (t,D)
(4)
Learning
The malware samples are accurately categorized with the f
Where t denotes token, denotes the number of frequency,
respective family name. In this research, we use Softmax-
d denote every individual document, and D denotes all
Cross-Entropy loss to train the proposed DCNN model. The
documents used in the dataset.
Loss for the training data k is shown in equation 3.
2) DEEP LEARNING WITH TENSORFLOW
exp( f zt )
Loss = - log( ) FRAMEWOKR
å exp( f zt ) The TensorFlow is a machine learning system which is used
k (3) for high-level computations in a complex environment. We
Where fzt is the rank for the kth class; and fzt is the score for can implement different types of machine and deep learning
the correct family. The parameters of the model are learned algorithms by calling specific Application Programing
using Adam optimizer, which tries to minimize the loss Interface (API) of TensorFlow. It has different types of layers
incurred on the training data. which can be configured for complex computations, training
B. SOFTWARE PIRACY THREAT DETECTION the data, and supervise the state-run of each function [40, 41].
The main goal of the proposed deep learning methodology is The in-depth learning approach is designed to identify similar
to catch the pirated software from different types of source source codes in different types of programing languages using
codes. A deep learning methodology is designed to detect TensorFlow framework. Then, the extracted similar codes are
plagiarism in different types of source codes. The plagiarized used to identify the pirated software. The weighting values are
version of the software is the pirated copy in which cracker used as input to the deep learning model. The dense layer also
used the logic of the original software, as shown in Figure 1. called fully connected layer, which is configured for input and
First, the source codes are tokenized in preprocessing steps to output data. There are three dense layers configured with 100,
reduce the dimensions of the data and extract meaningful 50, and 30 neurons, respectively. The first layer is used to
features for next step. The TensorFlow framework with Keras receive the data with an input variable with an input shape
API deep learning algorithm is applied on extracted parameter. Every neuron is receiving information from
meaningful features to catch the plagiarism between different previously layers and thus densely connected. The 4th dense
types of source codes. The dataset is collected from GCJ, layer is used for the output variable to target the plagiarized
which includes 400 different source codes documents from code.
100 programmers. The dataset is collected from Google Code The deep learning approach is enhanced using drop out layer
Jam (GCJ) database. in the context of activation and loss function, optimizer, and
1) PRE-PROCESSING AND FEATURE learning error rate. The overfitting problem is also solved
EXTRACTION using dropout layer. The rectifier (ReLU) activation method is
The detection of pirated software is among different types of used for input variables to get the patterns of received data
source codes is a challenging task because each source code [42]. Mathematically, it is expressed as the positive chunk of
has different syntax and semantic structures. We used software its argument given in equation 5.
plagiarism methods to identify source code similarity. The f (x) = x + = max(0,x)
(5)
5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access
the discrete adaptive learning rates for each limitation [44, 45].
Where x denotes the input to the equivalent neurons. The The decaying means of pas squared gradients are shown in
sigmoid is an effective logistic method used to grip the multi- Equation 7 and 8.
class problem [43]. It is defined mathematically as in equation mt = β1mt−1 + (1− β1 )gt (7)
6.
1 vt = β 2 vt−1 + (1− β 2 )g 2 (8)
S(x) =
1+ e − x (6) m v
where t and t are the predictable means of the first and
Where S represents the sigmoid function. The Adam second instant gradients, respectively.
optimizer also called stochastic descent gradient, is used in
compiling and optimizing deep learning model. It calculates
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access
1
https://sites.google.com/site/nckuikm/home
7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access
V. CONCLUSION
The industrial IoT based network is rapidly growing in the
coming future. The detection of software piracy and
malware threats are the main challenges in the field of
cybersecurity using IoT-based big data. We proposed a
combined deep learning based approach for the
identification of pirated and malware files. First, the
TensorFlow neural network is proposed to detect the
pirated feature of original software using software
plagiarism. We collected 100 programmers’ source codes
files from GCJ to investigate the proposed approach. The
source code is preprocessed to clean from noise and to
capture further the high-quality features which include
useful tokens. Then, TFIDF and LogTF weighting
techniques are used to zoom the contribution of each
token in terms of source code similarity. The weighting
values are then used as input to the designed deep learning
approach. Secondly, we proposed a novel methodology
based on convolution neural network and color image
visualization to detect malware using IoT. We have
converted the malware files into color images to get better
malware visualized features. Then, we passed these
visualized features of malware into deep convolution
neural network. The experimental results show that the
combined approach retrieve maximum classification
results as compared to the state of the art techniques.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access
FIGURE 3. The dynamic plot of Loss, Validation Loss, Accuracy and Validation accuracy for source codes
TABLE 2. COMPARISON OF CLASSIFICATION PERFORMANCE OF THE PROPOSED MALWARE DETECTION METHOD FOR DIFFERENT
IMAGE RATIO
Image Ratio Precision (%) Recall (%) F-measure(%) Accuracy (%) Time (s)
FIGURE 4. Dynamic graph of accuracy validated accuracy and loss and validated loss (Image Ratio: 224*224)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access
FIGURE 5. Dynamic graph of accuracy validated accuracy and loss and validated loss (Image Ratio: 229*229)
TABLE 3. THE COMPARISION OF CLASSIFICATION PERFORMANCE AMONG FORMER METHODS AND PROPOSED MODEL
10
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access
12
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.