You are on page 1of 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number

Cyber Security Threats detection in Internet of


Things using Deep Learning approach
1,2 1 3 4
FARHAN ULLAH , HAMAD NAEEM , SOHAIL JABBAR , SHEHZAD KHALID , MUHAMMAD AHSAN
5 6 7
LATIF , FADI AL-TURJMAN , AND LEONARDO MOSTARDA .
1
College of Computer Science, Sichuan University, Chengdu 610065, China (e-mail: farhankhan.cs@yahoo.com, hamadnaeemh@yahoo.com)
2
Department of Computer Science, COMSATS University Islamabad, Sahiwal Campus, Pakistan
3
Department of Computer Science, National Textile University, Faisalabad, Pakistan (e-mail: sjabbar.research@gmail.com)
4
Department of Computer Engineering, Bahria University, Islamabad, Pakistan (e-mail: shehzad_khalid@hotmail.com)
5
Department of Computer Science, University of Agriculture, Faisalabad, Pakistan (e-mail: mahsanlatif@uaf.edu.pk)
6
Antalya Bilim University, Turkey (e-mail: alturjman@outlook.com)
7
Computer Science Department, Camerino University, Italy (e-mail: leonardo.mostarda@unicam.it)

Corresponding author: Sohail Jabbar (sjabbar.reserarch@gmail.com)

ABSTRACT The IoT (Internet of Things) connect systems, applications, data storage, and services that may
be a new gateway for cyber-attacks as they constantly offer services in the organization. Currently, software
piracy and malware attacks are high risks to compromise the security of IoT. These threats may steal
important information that causes economic and reputational damages. In this paper, we have proposed a
combined deep learning approach to detect the pirated software and malware infected files across the IoT
network. The TensorFlow deep neural network is proposed to identify pirated software using source code
plagiarism. The tokenization and weighting feature methods are used to filter the noisy data and further, to
zoom the importance of each token in terms of source code plagiarism. Then, the deep learning approach is
used to detect source code plagiarism. The dataset is collected from Google Code Jam (GCJ) to investigate
software piracy. Apart from this, the deep convolutional neural network is used to detect malicious infections
in IoT network through color image visualization. The malware samples are obtained from Maling dataset
for experimentation. The experimental results indicate that the classification performance of the proposed
solution to measure the cyber security threats in IoT are better than the state of the art methods.

INDEX TERMS Internet of Things, Data Mining, Cyber Security, Software Piracy, Malware detection

I. INTRODUCTION that may be shared using cloud infrastructure. The IoT enabled
The Internet of Things (IoT) is the interconnection of physical technologies can be used to develop smart cities, education
moving objects “Things” through internet embedded with an system, e-shopping, e-banking, maintain our health, manage
electronic chip, sensors, and other forms of hardware. Each industry, and to entertain and protect human beings [3].
device is uniquely identified globally by Radio Frequency
Identifier (RFID) tags. These smart objects communicate with The IoT devices can be used for an open attack due to always
available on the network. The industrial IoT-cloud can be easily
Other connected nodes and can be monitored and controlled targeted by malware infection and pirated software for harmful
remotely [1]. IoT offers pervasive connectivity to a usage and to compromise security [4, 5]. The software piracy is
widespread range of smart physical objects, service industries, the development of software by reusing source codes illegally
cloud computing services, and applications. IBM stated that from someone else’s work and disguise as the original version.
the number of connected devices through the internet is The cracker may copy the logic of the original software by
expected to increase up to 50 billion by 2020 [2]. It will reverse engineering procedures and then design the same logic
increase the number of communication networks with the in another type of source codes [6].. t is a severe threat to internet
connection of smart objects as well as the amount of big data security which gives access to unlimited downloads of pirated

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access

software, open source codes and, promotes and advertises of string from executables. The byte sequence technique is a static
pirated versions. It rapidly increases each year and gives huge based analysis that can be used to extract n-bytes sequences
economic loss to the software industry [7]. The Business from these patterns. The functional call graph is a static analysis
Software Alliance (BSA) 2016 report stated that the public method used to extract the structural analysis of codes [16].
software piracy ratio is approximately 39%, which results in A. Challenges:
business damages up to 52.2$ billion every year. Many types of
Software Piracy: Currently, every third installed software
research have shown that every software contains plagiarized
application is pirated. The intellectual digital property and
source codes in the context of logic with a range of 5% to 20%
authorship rights of software are not physical due to globally
[8, 9]. The intelligent software plagiarism techniques are
accessible on the internet and thus, hard to sustain [17]. The
required to catch the plagiarized source code in pirated software.
attacker may crack the original software and re-design the logic
Several plagiarism detection techniques are proposed, i.e., clone
into another type of programming language. It is really difficult
detection, source code similarity identification, software bugs
to catch the crackers’ malicious activities in cross-domain
analysis, and software birthmark investigation [10, 11]. These
source codes because each programming language has different
techniques mainly are structure and text-based analysis. The
syntax and sematic structures. There are some tools available
structure-based technique examines the basic structure of source
that can translate one type of source code to another type, i.e.,
codes, syntax trees, graph behavior, and function call graph of
MoHCA-Java [6, 18]. The widely available open source
sub routines. Because of this, it is limited to a specific
projects made it easy for the cracker to copy the original idea
programming language structure. So, if a cracker reuses the
and design their software. One cannot get paid by his ideas, but
logic of the original software to another type of programming
for the ability to offer the solutions into real-world.
language, then, it is hard to catch due to different structure
behavior. The industrial IoT cloud services may be used to Malware detection: The conventional methods may solve
secure and protect smart devices by designing intelligent code obfuscation concerns, but high computational cost is
software plagiarism and malware detection techniques. needed regarding texture feature mining using malware
visualizations. These types of feature extraction procedures do
Currently, malicious attacks are more easily abrupt due to the not perform well with large malware data analysis. Currently,
growing number of IoT networks. The malware attacks are malware is constantly generating, update, and manipulate that
usually planned to infect the privacy of IoT nodes, computer makes the detection more challenging [19, 20]. The proposed
systems, and smartphones over the internet. Several scanning malware detection method tries to response to the following
techniques are proposed to detect windows based malware by queries:
using specific signatures. The malware identification analysis is • How to identify malware with reduced overhead?
divided into two main methods, i.e., static and dynamic based • How to extract malware features with less computational
analysis. The dynamic analysis learns the patterns of malware cost?
files while executing code in a real-time virtual environment. • How to process big malware datasets to get better
Several types of research have shown that malicious behavior accuracy?
can be observed by function calls, function parameters analysis, In this paper, we proposed a combined deep learning approach
information flow, instruction traces, and visual analysis of to identify the pirated and Malware attacks on industrial IoT
codes. There are automated online tools available that can be cloud. The TensorFlow deep neural network is designed to
used to investigate the dynamic behavior of malicious codes. capture the pirated software using source code plagiarism.
i.e., CW Sandbox, Anubis, and TT analyzer. This method is Further, the deep convolution neural network is designed to
more time consuming due to monitoring every dynamic capture the malicious patterns of malware through binary
behavior of source codes [12-14]. The static malware analysis visualization. The combined solutions of the proposed
methods do not need the real-time execution of source codes. It approach are much promising in terms of classification
may be used to capture the format information of malware performance. The remaining paper as organized follows.
executable binaries. The signature-based malware identification Section II contains a literature review, section III includes the
techniques are static based, i.e., control flow graph, opcode proposed methodology, the results and discussions are given
frequency, n-gram, and string signature. The disassembling in section IV, and finally, the conclusion is given in section V.
tools, i.e., IDA Pro and OllyDbg [15], are applied to uncover the
II. LITERATURE REVIEW
executables before implementing static based algorithms. These For efficient threat detection in the IoT environment, more
disassemblers are used to extract the hidden patterns from binary studies have been conducted on malware detection and
executables. Then, these patterns are used to retrieve encoded
2

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access

software plagiarism detection to achieve high Identification them based on machine learning. They classified a dataset
performance, and reduce time cost. consists of 1444 worm files, and 1330 benign files and results
Several studies have observed the impact of different aspects showed that classification accuracy of 96% [28] proposed an
of software plagiarism detection. Most of the previous work is approach based on N-Opcode sequences and applied machine
complete on a single programming language. It means that, if learning using SVM classifier. They attained an accuracy of
a cracker alters the logic of the source code to other data 98% in malware detection. Opcodes are the machine code
structure in the similar programming language, then the mnemonics, whereas the cosine similarity and critical
present literature is valuable. In [21], the software benchmark instruction sequence techniques were used for the malware
was used to detect the threat in java source codes. It selected identification process. Results showed that every version of
the control flow information to capture the structural similar malware family share some common core signatures
characteristics from the running source codes. The and can be captured using core API calls sequences. It [29]
benchmarks of two source codes were compared to compute proposed an obfuscation scheme to check the limitations of
the similarity. In [22], the author selected a hybrid approach static analysis approaches. The experimental results showed
to extract the similarity features from intermediate code that the static analysis independently is not sufficient for
generation based on compiler infrastructure. effective analysis of malware samples. Static analysis
Additionally, the unsupervised learning approach was used to approaches can easily be avoided if the malware is packed or
measure similarity in source codes. The similar functionalities obfuscated. Therefore, some robust behavioral features are
of different source codes were used to detect plagiarism. In required in the analysis. Such types of malware detection
[23], the author proposed a Source Forager search engine that approaches are usually called dynamic analysis approaches.
was used to extract like functionalities between C and Dynamic analysis approaches are usually based on the
C++code in response to user questions. The developed search execution of binary samples in a controlled environment to
engine retrieved different properties from code and proceeds extract features within a virtual machine. In [30], the author
in the shape of k functionalities from the corpus. The logic was proposed a technique to automate the manual process of
the conceptual structure of the program, and it could detect malware analysis. They identify the malicious behavior within
software similarity. In [24], the logic-based approach was the program, which was not previously shown by a benign
used to compute the behavior of dissimilarity in two source application. This approach gives a short amount of
codes by the same inputs. If there was no dissimilarity, then it information regarding malicious behavior. Therefore, it can
originated in plagiarism problem. The symbolic execution and provide significant insights for understanding malware. In
preconditions reasoning was used to capture the semantics for [31], the author proposed a useful clustering based approach
dissimilarities from execution paths. In [25], the Latent which can automatically scale a huge amount of malware
Semantic Analysis (LSA) was used to identify plagiarism in samples into group/classes of clusters based on their execution
students' assignments. The LSA was combined with PlaGate behavior. They also extended the Anubis system for additional
to inspect the semantic resemblance between source codes network analysis and tainted tracking to generate automated
documents. Apart from this, the author described how truce report for the selected malware samples. Their extended
different code chunks were important concerning semantics. system was able to characterize different activities of the
A syntax tree from any source code was drawn based on the program more abstractly. To overcome the issues of
parse tree. It could be used to compare different source codes computation time and limitations of static approaches, the
based on their syntax tree. In [26], the parse tree kernel hybrid approaches are introduced to improve malware
method was used to extract similar text between java source detection systems. In [32] collectively used both static and
codes. This process did not provide improved outcome dynamic features extracted from malware samples to train a
because of the irregular variations of nodes in core malware classifier and named their approach OPEM. They
functionalities. The fingerprinting method was designed to used frequency of occurrence of operational codes as static
extract the resemblance between sources. features while execution traces of executable files, system
Several studies have conducted static, dynamic, hybrid, and calls, exceptions as dynamic features to train their
visualization analysis for malware detection in the past. The classification model. The results showed that the hybrid
detailed summary of each type of analysis is presented. Static approach performs much better as a combined approach rather
analysis techniques usually consist of features extraction by than running of static and dynamic approaches separately.
static means from binary files using binary data extraction Many studies have been directed for visualization of malware
tools. In [27], uses a sequence of variable length instructions features to improve the performance of classification results as
and classify worms from the binaries of benign files to classify well as a reduction in time, size, and resource overhead. For
3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access

example, [33] introduced a CNN and image-based malware because it generates exponential values using color images.
classification method. Their model achieved 98.52% The deep learning algorithms outperform with big malware
accuracy. They randomly separated only 10% samples in a datasets as these types of methods can use filters to decrease
family for testing. [34] designed a deep learning model for noise automatically. Thus, the use of color images generates
malware detection. Their model obtained 98% accuracy for better results using deep learning techniques.
9339 samples. In [35], the author proposed a CNN based The conversion of malware binary file to color image
malware classification method. Their approach acquired comprises of four phases. First, the hexadecimal strings (0-15)
94.5% accuracy. are produced from raw binary files. Second, a hexadecimal
stream is divided into a chunk of the 8-bit vector where each
III. PROPOSED METHODOLOGY FOR CYBER 8-bit segment is measured as an unsigned integer (0-255).
SECURITY THREATS AND PROTECTION MEASURES IN Third, the 8-bit vector is subsequently converted into a two-
INDUSTRIAL IOT USING BIG DATA
dimensional matrix space. Fourth, each 8-bit integer generates
from two-dimensional space is plotted with of red, green, and
In this study, we propose an architecture model for cyber
blue shaded colors. Fig.1 presents the complete phases of data
security threats and protection measures in industrial IoT, as
preprocessing section.
shown in Fig. 1. Four databases are deployed in cloud storage
of the proposed architecture model. The database one stores
2) DEEP CONVOLUTIONAL NEURAL NETWORK
the raw network traffic; the database two contains a list of
The Deep Convolutional Neural Network (DCNN) is
previous datasets, and the database three stores the new
proposed to conduct in-depth malware data analysis. The
signatures of detected malware attacks. The cracker stores
DCNN contains five modules, as shown in Figure 1. The input
pirated software in database 4 by IoT devices. A massive
layer is used to receive training images for the designed neural
amount of data processing needs a high time cost. The raw
network model. First, the convolution layer is used to decrease
data files in database one are passed through the pre-
the noise and give better signal characteristics. Second, the
processing data module. The pre-processing data is further
pooling layer is used to decrease the data overhead preserving
sent to the detection module. The detection module captures
useful information. Third, the fully connected layer is used to
malware and pirate software attacks by learning from the
convert the two-dimensional array into the one-dimensional
signatures in databases two and four. If any malicious activity
and then input it to the specific classifier. Fourth, the malware
is observed in the network, then warns the system
families from respective images are identified by using the
administrator to take suitable action.
classifier.
A. MALWARE THREAT DETECTION 3) CONVOLUTION LAYER
1) DATA PREPROCESSING The meaningful features are extracted by reducing the image
The color images are generated from raw binary files to of parameters using convolution layer. Convolution layer, i.e.,
transform the malware detection problem using image interpretation invariance, rotation invariance, and scale
classification problem. It differentiates the proposed research invariance. It decreases the over-fitting problem and gives the
from state of the art methods [19, 35, 36], in which malware generalization concept to the main architecture. The input of
binary files convert into a grayscale image with 256 colors. the convolutional layer is several maps, as shown in equation
This method does not depend on any reverse engineering tool 1 [37].
such as disassembler and decompile. The color images can
retrieve better features as compared to grayscale images with æ ö
only 256 colors. x lj = f ç å x lj-1 * kijl + blj ÷
ç iÎM ÷
Further, the better features of malware images can outperform è j ø (2)
in the classification of malware families. Before, numerous
malware detection methods based on machine learning
M kl
Where j denotes the cluster of given maps; ij defines
algorithms [19, 20] offered better outcomes using grayscale convolution kernel, used for joining the ith input feature map
images. The color images are transformed into grayscale
bl
visualization, and then feature extraction techniques were used with jth output feature map; j is the bias consistent to the ith
to classify malware type. The classification performance is feature map, and is the activation function.
improved using feature reduction methods to decrease the 4) POOLING LAYER
features’ set. The results showed that machine learning The sub-sampling layer term is generally used for pooling
algorithms are not a better choice for malware detection layer which offers two modes of pooling, i.e., maximum and

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access

average pooling. It is not disturbed by backward propagation preprocessing methods are used to break the source codes in
and used to minimize the consequence of image distortion. It small pieces for deep analysis. It converts the codes into
also decreases the features’ factor and increases the proposed meaningful information and removes noisy data. The data is
DCNN functioning, as shown in equation 2. cleaned used from undesirable information, i.e., special

(
x lj = f down ( x lj-1 ) + blj
(2)
) symbol, constants, stop words. Then, the tokenization process
is used to transform the cleaned data into useful tokens. The
Where down (.) performs a pooling task, and b represents stemming, root words and frequency constraints are used to
bias value. mine more useful features in the preprocessing phase. Then,
Fully connected layer the weighting techniques are used to zoom the contribution of
each token. The TFIDF and Logarithm of Term Frequency
The fully connected layer is used to classify the output of
(LogTF) are used in the weighting phase [38, 39].
pooling layer. Every neuron links with the preceding neuron
Mathematically TFIDF is defined as in Equation 4.
with a corresponding fully connected layer. This layer is used
to improves the model generalization competency by
decreasing the overfitting issue. tfidf (t,d,D) = tf (t,d) × idf (t,D)
(4)
Learning
The malware samples are accurately categorized with the f
Where t denotes token, denotes the number of frequency,
respective family name. In this research, we use Softmax-
d denote every individual document, and D denotes all
Cross-Entropy loss to train the proposed DCNN model. The
documents used in the dataset.
Loss for the training data k is shown in equation 3.
2) DEEP LEARNING WITH TENSORFLOW
exp( f zt )
Loss = - log( ) FRAMEWOKR
å exp( f zt ) The TensorFlow is a machine learning system which is used
k (3) for high-level computations in a complex environment. We
Where fzt is the rank for the kth class; and fzt is the score for can implement different types of machine and deep learning
the correct family. The parameters of the model are learned algorithms by calling specific Application Programing
using Adam optimizer, which tries to minimize the loss Interface (API) of TensorFlow. It has different types of layers
incurred on the training data. which can be configured for complex computations, training
B. SOFTWARE PIRACY THREAT DETECTION the data, and supervise the state-run of each function [40, 41].
The main goal of the proposed deep learning methodology is The in-depth learning approach is designed to identify similar
to catch the pirated software from different types of source source codes in different types of programing languages using
codes. A deep learning methodology is designed to detect TensorFlow framework. Then, the extracted similar codes are
plagiarism in different types of source codes. The plagiarized used to identify the pirated software. The weighting values are
version of the software is the pirated copy in which cracker used as input to the deep learning model. The dense layer also
used the logic of the original software, as shown in Figure 1. called fully connected layer, which is configured for input and
First, the source codes are tokenized in preprocessing steps to output data. There are three dense layers configured with 100,
reduce the dimensions of the data and extract meaningful 50, and 30 neurons, respectively. The first layer is used to
features for next step. The TensorFlow framework with Keras receive the data with an input variable with an input shape
API deep learning algorithm is applied on extracted parameter. Every neuron is receiving information from
meaningful features to catch the plagiarism between different previously layers and thus densely connected. The 4th dense
types of source codes. The dataset is collected from GCJ, layer is used for the output variable to target the plagiarized
which includes 400 different source codes documents from code.
100 programmers. The dataset is collected from Google Code The deep learning approach is enhanced using drop out layer
Jam (GCJ) database. in the context of activation and loss function, optimizer, and
1) PRE-PROCESSING AND FEATURE learning error rate. The overfitting problem is also solved
EXTRACTION using dropout layer. The rectifier (ReLU) activation method is
The detection of pirated software is among different types of used for input variables to get the patterns of received data
source codes is a challenging task because each source code [42]. Mathematically, it is expressed as the positive chunk of
has different syntax and semantic structures. We used software its argument given in equation 5.
plagiarism methods to identify source code similarity. The f (x) = x + = max(0,x)
(5)
5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access

the discrete adaptive learning rates for each limitation [44, 45].
Where x denotes the input to the equivalent neurons. The The decaying means of pas squared gradients are shown in
sigmoid is an effective logistic method used to grip the multi- Equation 7 and 8.
class problem [43]. It is defined mathematically as in equation mt = β1mt−1 + (1− β1 )gt (7)
6.
1 vt = β 2 vt−1 + (1− β 2 )g 2 (8)
S(x) =
1+ e − x (6) m v
where t and t are the predictable means of the first and
Where S represents the sigmoid function. The Adam second instant gradients, respectively.
optimizer also called stochastic descent gradient, is used in
compiling and optimizing deep learning model. It calculates

FIGURE 1: Architecture model for Cyber Security Threats prediction in IoT


The software plagiarism measure may be used to
investigate the code similarity in pirated software. We
took source code dataset from Google Code Jam (GCJ) to
IV. RESULTS AND DISCUSSIONS analyze the proposed methodology for software piracy.
We investigate the proposed research in case study 1 and case First, the dataset is preprocessed to get the useful tokens
study 2. for each source with frequency details. The preprocessing
A. PERFORMANCE EVALUATION OF CASE STUDY process includes stemming, root word, maximum and
1 minimum token’s length, maximum and minimum
token’s frequency, etc. Second, features selection and

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access

extraction techniques, i.e., Term Frequency and Inverse


Document Frequency (TFIDF) and Logarithm Term
Frequency (LogTF) are used to retrieve the tokens’
weighting. The weighting methods are used to zoom the
contribution of each token in the document or across all
documents. The box plot of weighting values is shown in
figure 2. The P1, P2, P3, and P4 on x-axis denote the
programming solutions of four solved questions,
respectively. The weighting values of each source code
are given on the y-axis. Each programmer solved four
different programming problems, so; we have four input
variables in the first dense layer. The second and third are
configured as hidden layers to improve accuracy. The

dropout layer is configured with input and
each hidden layer to overcome the overfitting problem. FIGURE 2: Box plot of source codes weighting values

The dynamic visualization of accuracy, validation


accuracy, loss, and validation loss in percentage measure B. PERFORMANCE EVALUATION OF CASE STUDY
are given in Figure 2. In the upper panel, the blue curve 2
shows the loss, and the green curve shows validation loss.
Initially, both curves start from same 0.7 points, and up to We measured the effect of varying malware image ratios
35 epochs behave the same, but after the blue curve on the classification performance of the proposed
fluctuating on the green curve. As both curves decreasing malware detection method. These image ratios were
so, it has fewer chances of overfitting problem. If these 224×224 and 229×229. We selected 14,733 malware and
curves are increasing or behave opposite, then overfitting 2486 benign samples from the Leopard Mobile dataset1.
occurs. The comparison is made of the proposed approach After experimentation, we decided that 229×229 ratio was
with another state of the art techniques, as shown in Table superior to 224×224 ratio regarding classification
1. performance. However, a significant difference between
229×229 and 224×224 image sizes was classification
TABLE 1: COMPARISON WITH STATE OF THE ART PIRACY accuracy. Thus, we concluded that 229×229 image ratio
DETECTION TECHNIQUES is a more appropriate choice for the proposed malware
detection method. For 224×224 and 229×229 image
Proposed Source Accurac
Technique ratios, the dynamic graph for accuracy, validated
by Code y (%)
Bandara, Java KNN 86% accuracy, loss, and validated loss are shown in Figure 3
Wijayrathna
[46]
and 4. To demonstrate the role of image ratio in
Cosma, G. Java LSA 99% classification performance, Table 2 shows the brief
and M. Joy comparison of classification results of different image
[25]
Son, J.-W Java Parse Tree 90% ratios. 229×229 image ratio obtained 97.46% testing
[26] accuracy with 36s computational time for the Leopard
Ullah, F [47] C++, Java LSA 80%
Mobile dataset. The confusion matrix for 229×229 image
Ullah, F [6] C++, Java, MLR 86% ratio is shown in Figure 5.
Python
Our C++, Java, Deep Neural 96% We also compared the performance of proposed deep
Proposed Python Network learning-based malware detection with existing machine
Approach
learning-based malware detection researches
[35],[48]and[47] on the Leopard Mobile dataset. These
approaches used traditional image feature extraction
descriptors, namely, GIST, LBP, and CLGM descriptors
with SVM classifier. The maximum classification

1
https://sites.google.com/site/nckuikm/home
7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access

accuracy obtained by the proposed approach was 97.46%,


GIST + SVM was 86.1%, LBP + SVM

was 78.05%, and CLGM + SVM was 92.06%


respectively. The F-measure is a weighted average of
precision and recall. It is the harmonic mean of precision
and recall. The highest F-measure obtained by the
proposed method was 97.44%, GIST + SVM was 85.82%,
LBP + SVM was 77.49%, and CLGM + SVM was
91.98% respectively.
Table 3 shows that the proposed malware detection
method achieves better results against three existing
machine learning-based works in terms of classification
performance.

V. CONCLUSION
The industrial IoT based network is rapidly growing in the
coming future. The detection of software piracy and
malware threats are the main challenges in the field of
cybersecurity using IoT-based big data. We proposed a
combined deep learning based approach for the
identification of pirated and malware files. First, the
TensorFlow neural network is proposed to detect the
pirated feature of original software using software
plagiarism. We collected 100 programmers’ source codes
files from GCJ to investigate the proposed approach. The
source code is preprocessed to clean from noise and to
capture further the high-quality features which include
useful tokens. Then, TFIDF and LogTF weighting
techniques are used to zoom the contribution of each
token in terms of source code similarity. The weighting
values are then used as input to the designed deep learning
approach. Secondly, we proposed a novel methodology
based on convolution neural network and color image
visualization to detect malware using IoT. We have
converted the malware files into color images to get better
malware visualized features. Then, we passed these
visualized features of malware into deep convolution
neural network. The experimental results show that the
combined approach retrieve maximum classification
results as compared to the state of the art techniques.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access


FIGURE 3. The dynamic plot of Loss, Validation Loss, Accuracy and Validation accuracy for source codes

TABLE 2. COMPARISON OF CLASSIFICATION PERFORMANCE OF THE PROPOSED MALWARE DETECTION METHOD FOR DIFFERENT
IMAGE RATIO

Image Ratio Precision (%) Recall (%) F-measure(%) Accuracy (%) Time (s)

224×224 95.16 95.10 95.13 95.10 20s

229×229 97.43 97.46 97.44 97.46 36s

Note: NS=Not specified, CA=Classification accuracy, PR=Precision rate, RR=Recall rate


FIGURE 4. Dynamic graph of accuracy validated accuracy and loss and validated loss (Image Ratio: 224*224)

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access

FIGURE 5. Dynamic graph of accuracy validated accuracy and loss and validated loss (Image Ratio: 229*229)

FIGURE 6. Confusion matrix for 229*229 Image Ratio

TABLE 3. THE COMPARISION OF CLASSIFICATION PERFORMANCE AMONG FORMER METHODS AND PROPOSED MODEL

Works Year Samples Technique F-measure (%) CA (%)


(Malware/Benign)
GIST+SVM [17] 2011 4000/2000 ML 85.82 86.1

LBP+SVM [46] 2018 4000/2000 ML 77.49 78.05


CLGM+SVM [47] 2019 4000/2000 ML 91.98 92.06
Our Proposed - 14,733/2486 DL 97.44 97.46
Approach

10

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access

Proceedings of the 10th Innovations in Software Engineering


Conference. 2017. ACM.
REFERENCES
23. Kashyap, V., et al., Source forager: a search engine for similar
source code. arXiv preprint arXiv:1706.02769, 2017.
1. Srinivasan, C., et al. A review on the different types of internet of
24. Zhang, F., et al. Program logic based software plagiarism
things (IoT). Journal of Advanced Research in Dynamical and
detection. in 2014 IEEE 25th International Symposium on
Control Systems, 2019. 11(1): p. 154-158.
Software Reliability Engineering. 2014. IEEE.
2. Joyia, G.J., et al., Internet of Medical Things (IOMT):
25. Cosma, G. and M. Joy, An approach to source-code plagiarism
applications, benefits and future challenges in healthcare
detection and investigation using latent semantic analysis. IEEE
domain. J Commun, 2017. 12(4): p. 240-7.
transactions on computers, 2012. 61(3): p. 379-394.
3. Zanella, A., et al., Internet of things for smart cities. IEEE
26. Son, J.-W., et al., An application for plagiarized source code
Internet of Things journal, 2014. 1(1): p. 22-32.
detection based on a parse tree kernel. Engineering Applications
4. Karbab, E.B., et al., Android malware detection using deep
of Artificial Intelligence, 2013. 26(8): p. 1911-1918.
learning on api method sequences. arXiv preprint
27. Siddiqui, M., M.C. Wang, and J. Lee, Detecting internet worms
arXiv:1712.08996, 2017.
using data mining techniques. Journal of Systemics, Cybernetics
5. Jabbar, S., et al., A methodology of real-time data fusion for
and Informatics, 2009. 6(6): p. 48-53.
localized big data analytics. IEEE Access, 2018. 6: p. 24510-
28. Kang, B., et al. N-opcode analysis for android malware
24520.
classification and categorization. in 2016 International
6. Ullah, F., et al., Software plagiarism detection in
Conference On Cyber Security And Protection Of Digital
multiprogramming languages using machine learning
Services (Cyber Security). 2016. IEEE.
approach. Concurrency and Computation: Practice and
29. Moser, A., C. Kruegel, and E. Kirda. Limits of static analysis for
Experience, 2018: p. e5000.
malware detection. in Twenty-Third Annual Computer Security
7. Chae, D.-K., et al. Software plagiarism detection: a graph-based
Applications Conference (ACSAC 2007). 2007. IEEE.
approach. in Proceedings of the 22nd ACM international
30. Christodorescu, M., S. Jha, and C. Kruegel. Mining
conference on Information & Knowledge Management. 2013.
specifications of malicious behavior. in Proceedings of the the
ACM.
6th joint meeting of the European software engineering
8. Akbulut, Y. and O. Dönmez, Predictors of digital piracy among
conference and the ACM SIGSOFT symposium on The
Turkish undergraduate students. Telematics and Informatics,
foundations of software engineering. 2007. ACM.
2018. 35(5): p. 1324-1334.
31. Bayer, U., et al. Scalable, behavior-based malware clustering. in
9. ShanmughaSundaram, M. and S. Subramani, A measurement of
NDSS. 2009. Citeseer.
similarity to identify identical code clones. Int Arab J Inform
32. Santos, I., J. Nieves, and P.G. Bringas. Semi-supervised learning
Technol, 2015. 12: p. 735-740.
for unknown malware detection. in International Symposium on
10. Ragkhitwetsagul, C. Measuring code similarity in large-scaled
Distributed Computing and Artificial Intelligence. 2011.
code Corpora. in 2016 IEEE International Conference on
Springer.
Software Maintenance and Evolution (ICSME). 2016. IEEE.
33. Kalash, M., et al. Malware classification with deep
11. Imran, S., et al., An enhanced framework for extrinsic plagiarism
convolutional neural networks. in 2018 9th IFIP International
avoidance for research article. Technical Journal, 2018. 23(01):
Conference on New Technologies, Mobility and Security
p. 84-92.
(NTMS). 2018. IEEE.
12. Shabtai, A., et al., Detection of malicious code by applying
34. Kumar, R., et al. Malicious Code Detection based on Image
machine learning classifiers on static features: A state-of-the-art
Processing Using Deep Learning. in Proceedings of the 2018
survey. information security technical report, 2009. 14(1): p. 16-
International Conference on Computing and Artificial
29.
Intelligence. 2018. ACM.
13. Egele, M., et al., A survey on automated dynamic malware-
35. Cui, Z., et al., Detection of malicious code variants based on
analysis techniques and tools. ACM computing surveys
deep learning. IEEE Transactions on Industrial Informatics,
(CSUR), 2012. 44(2): p. 6.
2018. 14(7): p. 3187-3196.
14. Ghafir, I., et al., Security threats to critical infrastructure: the
36. Xiaofang, B., et al. Malware variant detection using similarity
human factor. The Journal of Supercomputing, 2018: p. 1-17.
search over content fingerprint. in The 26th Chinese Control and
15. Raz, I., Introduction to Reverse Engineering. Cs. tau. ac. il, 2011.
Decision Conference (2014 CCDC). 2014. IEEE.
16. Gandotra, E., D. Bansal, and S. Sofat, Malware analysis and
37. Bouvrie, J., Notes on convolutional neural networks. 2006.
classification: A survey. Journal of Information Security, 2014.
38. Paik, J.H. A novel TF-IDF weighting scheme for effective
5(02): p. 56.
ranking. in Proceedings of the 36th international ACM SIGIR
17. Moore, A., Intellectual property and information control:
conference on Research and development in information
philosophic foundations and contemporary issues. 2017:
retrieval. 2013. ACM.
Routledge.
39. Haddi, E., X. Liu, and Y. Shi, The role of text pre-processing in
18. Malabarba, S., P. Devanbu, and A. Stearns. MoHCA-Java: a tool
sentiment analysis. Procedia Computer Science, 2013. 17: p. 26-
for C++ to Java conversion support. in Proceedings of the 1999
32.
International Conference on Software Engineering (IEEE Cat.
40. Abadi, M., et al. Tensorflow: A system for large-scale machine
No. 99CB37002). 1999. IEEE.
learning. in 12th {USENIX} Symposium on Operating Systems
19. Nataraj, L., et al. Malware images: visualization and automatic
Design and Implementation ({OSDI} 16). 2016.
classification. in Proceedings of the 8th international symposium
41. Baylor, D., et al. Tfx: A tensorflow-based production-scale
on visualization for cyber security. 2011. ACM.
machine learning platform. in Proceedings of the 23rd ACM
20. Makandar, A. and A. Patrot. Malware class recognition using
SIGKDD International Conference on Knowledge Discovery
image processing techniques. in 2017 International Conference
and Data Mining. 2017. ACM.
on Data Management, Analytics and Innovation (ICDMAI).
42. Agostinelli, F., et al., Learning activation functions to improve
2017. IEEE.
deep neural networks. arXiv preprint arXiv:1412.6830, 2014.
21. Lim, H.-i., et al., A method for detecting the theft of Java
43. Elfwing, S., E. Uchibe, and K. Doya, Sigmoid-weighted linear
programs through analysis of the control flow information.
units for neural network function approximation in
Information and Software Technology, 2009. 51(9): p. 1338-
reinforcement learning. Neural Networks, 2018. 107: p. 3-11.
1350.
44. Tato, A. and R. Nkambou, Improving adam optimizer. 2018.
22. Yasaswi, J., et al. Unsupervised learning based approach for
45. Zhang, Z. Improved Adam Optimizer for Deep Neural Networks.
plagiarism detection in programming assignments. in
in 2018 IEEE/ACM 26th International Symposium on Quality of
Service (IWQoS). 2018. IEEE.
11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2019.2937347, IEEE
Access

46. Bandara, U. and G. Wijayrathna, Detection of source code


plagiarism using machine learning approach. Int J Comput
Theory Eng, 2012. 4(5): p. 674.
47. Ullah, F., et al., Plagiarism detection in students’ programming
assignments based on semantics: multimedia e-learning based
smart assessment methodology. Multimedia Tools and
Applications, 2018: p. 1-18.
48. Cui, Z., et al., Malicious code detection based on CNNs and
multi-objective algorithm. Journal of Parallel and Distributed
Computing, 2019.

12

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

You might also like