You are on page 1of 13

International Journal of Machine Learning and Cybernetics

https://doi.org/10.1007/s13042-020-01246-9

ORIGINAL ARTICLE

Clone detection in 5G‑enabled social IoT system using graph


semantics and deep learning model
Farhan Ullah1 · Muhammad Rashid Naeem2 · Leonardo Mostarda3 · Syed Aziz Shah4

Received: 29 July 2020 / Accepted: 19 November 2020


© Springer-Verlag GmbH Germany, part of Springer Nature 2021

Abstract
The protection and privacy of the 5G-IoT framework is a major challenge due to the vast number of mobile devices. Special-
ized applications running these 5G-IoT systems may be vulnerable to clone attacks. Cloning applications can be achieved
by stealing or distributing commercial Android apps to harm the advanced services of the 5G-IoT framework. Meanwhile,
most Android app stores run and manage Android apps that developers have submitted separately without any central veri-
fication systems. Android scammers sell pirated versions of commercial software to other app stores under different names.
Android applications are typically stored on cloud servers, while API access services may be used to detect and prevent
cloned applications from being released. In this paper, we proposed a hybrid approach to the Control Flow Graph (CFG) and
a deep learning model to secure the smart services of the 5G-IoT framework. First, the newly submitted APK file is extracted
and the JDEX decompiler is used to retrieve Java source files from possibly original and cloned applications. Second, the
source files are broken down into various android-based components. After generating Control-Flow Graphs (CFGs), the
weighted features are stripped from each component. Finally, the Recurrent Neural Network (RNN) is designed to predict
potential cloned applications by training features from different components of android applications. Experimental results
have shown that the proposed approach can achieve an average accuracy of 96.24% for cloned applications selected from
different android application stores.

Keywords 5G IoT · Clone detection · Control flow graph · Deep learning · Security · Privacy

1 Introduction

There is an overwhelming number of 5G-IoT networks due


to recent developments in electronics and communication
* Farhan Ullah methods (e.g., portable electronics, IoT devices, and 5G tel-
farhankhan.cs@yahoo.com ecommunications services solutions). These developments
Muhammad Rashid Naeem have enabled the quality of public services, including health-
rashidnaeem717@yahoo.com care, shipping, transport, etc. In recent years, the protection
Leonardo Mostarda of these systems has become increasingly important due to
leonardo.mostarda@unicam.it the obvious emergence of 5G-IoT systems and the fact that
Syed Aziz Shah the advancement of AI has made them more autonomous
ad5190@coventry.ac.uk and intelligent, with growing amounts of personal data being
produced and communicated using these modern 5G-IoT
1
School of Software, Northwestern Polytechnical University, systems. Many of these emerging security problems can-
Xi’an 710072, Shaanxi, People’s Republic of China
not be resolved by existing security frameworks to protect
2
School of Artificial Intelligence, Leshan Normal University, privacy. This has created significant challenges in the exist-
Leshan 614000, People’s Republic of China
ing 5G-IoT network architectures with a view to resolving
3
Computer Science Department, Camerino University, the protection and privacy of increasingly mobile devices,
Camerino 62032, Italy
servers, and the vast volume of data transmitted in real-time.
4
Mobile Health Centre for Intelligent Healthcare, Coventry As a result, the development of a secured 5G-IoT network is
University, Coventry CV15FB, UK

13
Vol.:(0123456789)
International Journal of Machine Learning and Cybernetics

Fig. 1  A chunk of cloned applications. a–c APK Pure (com.jiuzhangtech.hangman). d Google Market (br.com.passeionaweb)

required with new security approaches to handle the security Open source Android assists scammers to clone and resub-
and privacy of mobile applications. Android applications mit apps to the market place. The apps should undergo an
can be used over 5G-IoT devices [1–3]. Code components approval process in the Apple App Store in relation to the
are usually reused by copying and pasting into a different Android mobile market. A system that checks the latest
section in computer programming without or with minimal applications1,2 has been launched recently to the Android
changes. This reproduced code is called a code clone, and store.
this procedure is called code cloning [4]. Roy et al. [5] state Figure 1 illustrates that four projects are closely related
that cloning is valuable, but, in certain ways, it can even to the menu system and code, but different developers have
be detrimental. For instance, if a bug is found through one published them in the various phone marketplaces [13]. The
block of code, then the same bug must be examined for all result shows that the code of all apps overlaps, indicating
fragments similar to it. It may also raise maintenance costs that at least three of them are a clone. As the prominence
for applications. With consideration of high maintenance of Android apps growing, the number of applications has
costs, the identification of software clones is an impor- increased rapidly. Consequently, it is necessary to shield
tant area of study. For the detection of code clones, many developers’ intellectual property and revenue streams [14,
approaches and tools have been developed. In particular, 15].
text-based approaches, Token-based [6], Tree-based [7, 8], The logic flow of the programming codes can be used to
Semantic [9], and Hybrid approaches [10] are being used investigate the clone attacks in the modified versions. In this
to detect code clones. In the last few years, the revenue of paper, we used the CFG features from source codes to view
cell devices has grown considerably. Android does have a the logic flow of the codes. Then, these features are given to
new market share in smartphones, and recently, its mobile the designed RNN model for efficient classification of code
device has been activated at 850,000 per day. While Android clones. The main contributions of the paper are:
OS offers the primary mobile experience, a large part of
our customer experience is dependent on applications from 1. An approach for Clone Detection in Android-based
third parties. 5G-IoT System is proposed.
Android has a range of platforms where users can down- 2. Component-based CFG features are mined from source
load applications from third-party networks that allow easy codes using ANTLR parser.
access. Users need to be safe from scammers who seek to 3. Term Frequency Inverse Document Frequency (TFIDF)
take advantage of a real developer’s personal effort. Android method is used to extract the importance of each feature.
applications are made available on the authorized app store 4. CFG-based RNN model is proposed for component-
or other third-party marketplaces [11, 12]. A stable busi- based clone detection on large scale.
ness climate for developers must be established in order to
produce ongoing applications. A purchased application can
be breached and published, as well as a free application that 1
http://stack​overf​l ow.com/quest​ions/56001​43/andro​id-game-keeps​
makes the advertisement return to the scammer. A scammer getti​ng-hacke​d.
can substitute the project’s user details and modify an exist- 2
http://googl​emobi​le.blogs​pot.com/2012/02/andro​id-and-secur​ity.
ing software catalogue and insert a new revenue catalogue. html.

13
International Journal of Machine Learning and Cybernetics

The remainder part of the part is organized as follows: synthetic structure but does not provide full execution paths.
Sect. 2 discussed the recent literature, Sect. 3 explains the As a consequence, this approach is fragile to control replace-
proposed methodology, Sect. 4 presents the experimental ment, data variations, and code insertions.
results with discussions, and finally, Sect. 5 concludes the The semantic-based features are most efficient for accu-
paper. rate clone detection. These are high-quality features com-
posed of graph structures. These graphical features are
• Literature review used to capture the logic flow of the source codes instead
of just renaming or abstract view of statements. By doing
We characterize the pros and cons of a variety of this, the graph-based semantic features can catch any type
approaches for clone detection and conclude with the of clone attack made by scammer [28–30]. Therefore, the
proposed scheme. For instance, there are a number of semantic-based approach is significant, which can crawl the
approaches used for clone detection such as tokeniza- high-quality features for clone detection. We used the CFG-
tion, Abstract Syntax Tree (AST), semantic and hybrid based RNN approach for the identification of code clones in
approaches, etc. Every approach has a specific weakness the android application.
and strengths in different circumstances, which are dis-
cussed below. String matching Techniques based on string
matching are used to find similar strings between the source 2 Proposed scheme: clone detection
codes. The string-based matching algorithm is used between in android‑based 5G‑IoT system
two programs to identify the Longest Common Sequence
(LCS) [16, 17]. This method is robust in string compari- The software clone is a serious threat to the android-based
sons but fragile in renaming identification, code sequencing, 5G-IoT system. The android application can be decompiled
structural and semantic information. Finger Printing based to extract the Java source codes. After that, some compo-
approach generates fingerprints of each source code docu- nents can be cloned and publish it again by new account cre-
ment containing statistical information about it, such as the dentials. By doing this, the original developer can be harmed
average number of words in a line, number of similar words, with huge economic loss.
number of keywords in the entire document, number of sin- Figure 2 shows the complete framework for component-
gle operands and operators, number of operand and operator based android clone detection system. 5G-IoT devices can
occurrences, etc. Two programs shall be considered similar be used to interconnect different android markets via cloud
if they share the same fingerprints. Several studies [18–20] architecture. The users are free to use 5G-IoT devices for
are proposed to estimate the closeness using the distance upload and download android applications among different
formula. Such approaches are fragile to code sequences, android stores. Due to this, the genuine application can be
structural, and source code logic flows. easily downloaded, cloned, and upload to the store. There-
The tokenization method is used to divide to source codes fore, an intelligent android clone detection system is required
into tokens, and later these tokens are compared to compute to implement the cloud architecture. The cloud infrastructure
the similarity between program pairs. Different methods initiates the clone detection system to evaluate the submitted
are proposed, which are quite popular for identification of android app, whether it has been cloned or not. The instance
similar tokens in academics such as JPlag [21], Measure of of a cloned detection model has started. The software called
Software Similarity (MOSS) [22, 23], Yet Another Plague “apkExtractor” generates android Dalvik Executable (DEX)
(YAP) [24], etc. These methods transform the source codes files from the packaged device. The scammer is submitting
into tokens, and then they use different distance techniques an APK to the app store for publication.
to compare two tokens-based programs. This approach is
more effective than a simple string-based matching system 2.1 Component‑based source codes extraction
but fails when the statements are distorted in order. The
tokenization-based plagiarism detection method is robust The Jdex decompiler captures the source files in different
against identification renaming, spacing, and alignment, but types of directories, which indicate the android internal
it is fragile for code sequence, structural and semantic infor- component architecture. Sometimes, scammer targets only
mation. Tree matching based methods are used to compare specific components for cloning, such as User interface (UI),
the abstract view of structural information based on Abstract view model, and model source. Android offers a collection
Syntax Tree (AST) [25–27]. The AST has abstract view of pre-built UI components that enable users to create a GUI
information that presents variables and function statements. for the app, such as structured layout objects and UI controls.
This method is used to convert source codes to hierarchical Other interface UI modules, such as dialogues, alerts, or
features known as ASTs. These features are then used to menus, are also supported by Android. The UI contains the
equate the two source codes. The AST demonstrates abstract libraries and classes which support the GUI (Graphical User

13
International Journal of Machine Learning and Cybernetics

APK Extractor

Soware Clone
Detecon System
Jdex

Android Component-based Source Codes

View / UI Acvity /Fragment

Live Data
View Model ViewModel

Android App Market


Enty

Linear Features Extracon Model Source RoomDatabase


Developer/Cracker DAO

Pre-Processing

Maximum and Minimum


Frequencies

Control Flow Graph


Stemming

Bag of Words

Linear CFGs Features


Recurrent Neural Network

TFIDF Features’ Weighng Extracon


Component-based Clone
Classificaon Local Features + Global Features

Fig. 2  Clone detection in android-based 5G IoT system using CFG and deep learning

Interface) to view, input, and output. Architecture Compo- of programming codes. CFG uses graph notation to rep-
nents support the UI controller with a ViewModel helper resent all paths that may be traversed in the execution of
class which is responsible for setting up UI data. During a program. In the CFG, each node in the graph denotes a
configuration changes, ViewModel objects are immediately basic block such as a statement of code which can be with
preserved such that the data they contain is instantly acces- or without a jump. The jump focusses on the start and end
sible for the next instance of the operation. For instance, of a block of statements. This shows the control flow of
when we need to list the application’s users, ensure that we the programming code using the directed edges. So, the
are responsible for acquiring and maintaining a ViewModel combination of nodes and edges represent the source code
user list. These classes can be cloned to change the func- structure. Generally, there are two specific blocks used as
tionalities of the app so that the android store can consider the input block, that allows controlling the starting point
that it is a genuine application. We select component-based of the flow, and the exit block, that leaves the control [32,
source code files instead of processing the whole application 33]. Figure 3 shows the program execution paths of the
at once. The proposed method selects different components corresponding pseudocode. The P and S indicate the predi-
of the same directory to perform the clone detection system cate and statement nodes, respectively. T and F indicate
in parallel. By doing this, we can only focus on a specific the true and false of the statement. We used a random
portion of the cloned application and thus save time and graph walker strategy to mine the control and data flow
resources [28, 31]. execution paths of each programming code. The predicate
node executes on the basis of the pre-condition statement.
2.2 Control flow graph features extraction For example, the S2 is the block of statements controlled
by P1. If P1 is true, then S2 will execute otherwise S4.
The control-flow graph (CFG) contains graph-based Similarly, S3 depends on P2 in the if statement. These
features used to covert the source codes through logic are semantical features that show the actual programming
paths. These paths are then used to show the semantics flow. It can capture the unique programming style features

13
International Journal of Machine Learning and Cybernetics

Fig. 3  Program flow paths with


pseudocode

of each author. The CFG deal with three types of nodes in • Modifying the data flow in source code such as replacing
random walking, as following: the variables’ values.
• Changing the control flow of statements such as replacing
• Statement node: It refers to a node in the control-flow function calls or if statements.
graph of the program, which has only one exit. • Code insertion at different places.
• Predicate node: It represents a Boolean expression that • Changing the order of statements within code blocks with
can be used inside if, while, or for, etc. It refers to a semantics modification.
vertex that has two exits in the control-flow graph of • Addition of redundant statements or variables.
the particular programming code.
• Region node: It does not relate to programming state- 2.3 Features weighting
ments. It is included to summarize the dependencies of
a vertex, and it can be interpreted as a point of entry to The CFG features are hierarchal in nature and need to
a series of statements or a block of code. transform in linear form for efficient processing. We used
the preprocessing method to extract nodes and edges, pre-
We extract component-based CFG features from serving the order of execution. The overloaded noisy data
android source codes. Figure 4 shows an example of the is removed during preprocessing, as it may affect the clone
CFG of each method used in the merge sort. The compo- detection process. For instance, stop words, special sym-
nent-based CFG features are extracted for deep analysis of bols, etc. do not carry meaningful information for seman-
cloned files. There are three sub-components which inte- tic features. We extract nodes and edges with a number of
grate the working of merge sort. The node numbers repre- frequencies for each component. Next, we used local and
sent the order of execution of each statement or block. The global weighting method to compute the importance of
edge indicates the control flow of the code that is used to each feature. For instance, some nodes and edges may be
extract the semantic features from component-based. Due more important for clone classification, while some may
to this, we can analyze the order of statements through be noisy. These noisy features can affect classification
CFG features without the execution of the program. After accuracy. Due to this, we filtered the data for valuable fea-
extracting the component-based features, we analyzed the tures. The Term Frequency Inverse Document Frequency
significance of each feature using the weighting method. (TFIDF) is a statistical model that is proposed to reflect
These high-quality features can address the following how important a feature means to a single file in a group of
types of clone attacks. files. It has two parts, that is TF for local weight and IDF

13
International Journal of Machine Learning and Cybernetics

Fig. 4  An example: component-based CFG extraction from merge-sort

for global weight. Let T indicates the AST, and for each N
subtree s there is weight ws,T. The overall estimated weight
IDF(s, T) = log (3)
c(s)
of a node is the multiplication of TF. and IDF methods, as
defined using Eq. 1 [34].
Where c(s) is the number of CFG in s, and N denotes the
ws,T = TF(s, T) ∗ IDF(s, T) (1) total number of generated CFG from a collection of source
code documents. Nodes with equivalent TFIDF values can
Mathematically, the TF(s,T) and IDF(s,T) can be defined play an important role in the clone detection process. On
as in Eqs. 2 and 3. For each subtree s and AST tree T, the the other hand, if a node has a higher weight in one file but
weight can be calculated by multiplying TF and IDF. The null or lower in other files, then it has a small impact on the
TF is further explained mathematically in Eq. 2. efficacy of the proposed method.
cnt(s, T)
TF(s, T) = (2)
n(T) 2.4 Deep learning model

The Tensor Flow is an open-source library that is essential


Where cnt is the count which indicates the number of edges
for heterogeneous implementations of deep learning algo-
or nodes, such that s ∈ ST and n(T) is the number of subtrees
rithms. Several studies [35–37] reported that Tensor Flow
in the selected portion. The ST is the set of all subtrees used
is a serious need for the research market to practice online
in the corpus. The mathematical form of IDF is given in
machine and deep learning analysis tools. It has powerful
Eq. 3.
processing for visualization facility through which we can

13
International Journal of Machine Learning and Cybernetics

Table 1  RNN configuration with input, output, dropout layers, conversion of K numbers within [0,1]. In multiclass neural
shapes, and number of parameters networks, it is often used to transform the non-normalized
Layer Type Shape Parameters output to a predictive class. The softmax function [39] is
used in the output as we are dealing with a multiclass prob-
RNN_1 RNN 100 400
lem. The standard softmax function 𝜎 ∶ ℝK → ℝK which can
dropout_1 Dropout 100 0
be defined using Eq. 4.
RNN_2 RNN 80 8080
dropout_2 Dropout 80 0 ezi
𝜎(z)i = ∑K for i = 1, … , K and z = (z1 , ..., zK ) ∈ ℝK
RNN_3 RNN 60 4860 j e
zj

dropout_3 Dropout 60 0 (4)


RNN_4 RNN 40 2440
We incorporate a standardized exponential method for
dropout_4 Dropout 40 0
each input. It normalizes the output by splitting the total of
RNN_5 RNN 30 1230
these exponentials. The key goal of the training process is to
dropout_5 Dropout 30 0
learn more about the essence of the data to allow correct pre-
RNN_6 RNN 20 620
dictions of the unknown data. The ReLU activation method
dropout_6 Dropout 20 0
is used in input and hidden layers for a better understanding
RNN_7 RNN 10 210
of the RNN model. Mathematically, it is defined as the posi-
dropout_7 Dropout 10 0
tive part of its argument, as shown in Eq. 4.
dense_8 RNN 3 33
Total parameters 17,873 f (x) = x+ = max(0, x) (5)
Trainable parameters 17,873
Non-trainable parameters 0 where x represents the input to the corresponding neurons,
this is also known as a ramp function whose graph behaves
like a ramp based on unary real numbers. Model overfit-
ting [40] happens when the model acquires information and
demonstrate the effectiveness of a deep learning algorithm, acoustics in data that influence the performance of the deep
such as an epoch graph for loss, accuracy and error estima- learning model. For detecting overfitting in the experiment,
tion, etc. It delivers fast and efficient performance, often we employed validation metrics like validation loss and vali-
offering new value-added features with quick updates as sup- dation accuracy on train and test results. After many epochs,
ported by Google. It can be implemented on a wide scale, validation steps usually stop improving and further develop
starting from cellular devices to complex installations. We the training measurements to achieve the best match. We
configured a Recurrent Neural Network (RNN) model with have used three methods to handle the overfitting problem.
seven RNN layers, seven dropout layers, and one dense layer, First, we select the correct number of networks and hid-
as shown in Table 1. The first RNN layer is used as input den layers to optimize learning capacity. Secondly, we used
and the last dense layer as the output layer. The intermedi- regularization, such as applying a cost to the loss function
ate RNN layers are used as hidden layers to train and pro- for greater weights. After this, to eliminate some features
cess the parameters in-depth analysis. The shape shows the and prevent overfitting of the deep learning model, dropout
number of neurons in each layer. The Rectifier Linear Unit layers are implemented.
(ReLU) [38] activation function is used in input and hidden
layers. The number of layers, neurons, and dropout layers
assists us to fine-tune the deep learning model and to tackle
the overfitting problem. The parameters of a deep learning 3 Experiments and discussions
model are usually the connections’ weights. In this scenario,
during the training stage, these parameters are learned. Thus, 3.1 Dataset preparation
such parameters are tuned by the model itself. When making
predictions, they’re needed by the model. There is a total The dataset is prepared from five cloned applications such
of 17,873 numbers of parameters trained for the designed as ES file downloader, hangman, hangman 2, one cleaner,
approach. VPN master, collected from different android markets. The
We used the fit function with a number of the epoch, five different android stores, such as Apk Pure, Google play
x_train, y_train, validation split to train the designed RNN store, app brain, Aptoide, and Mobile1 Market, are analyzed
model. The training and testing ratio is 80%, 20%. The num- for cloned applications to prepare the dataset. The variable
ber of epoch shows the accuracy and loss of the train and number of source files are collected from each application
test data points. The more number of epochs give better such that, ES file has 400, and the VPN master has 300
training to the deep learning model. Softmax is used in the files, respectively. Similarly, the hangman and hangman 2

13
International Journal of Machine Learning and Cybernetics

Fig. 5  Number of component-based CFG features to train the model

have 46 and 84 files, respectively. The last is one cleaner same. The third clone upper quartile is slightly higher than
application that has 137 source files. After a thorough analy- the first two clones, which shows a minor difference between
sis, we collected 250, 43, 57, 65, and 190 clone files from the Java files of the third clone and the first two clone sets.
five applications. These cloned files are further divided into Thus, by training the normalized values of the original
different components such as GUI, View model, and room android application, it is possible to effectively predict the
database, etc. according to android architecture. The CFG possible clone applications on different android markets. It
hierarchical features are generated from these components should be noted that some android application scammers add
for a detailed analysis of subcloned files. Figure 5 shows noisy data to original files in order to bypass the clone detec-
the percentage ratios of component-based CFG features for tion systems. For instance, In Fig. 7, the median is between
training the RNN model. The Clone_Sets show the cloned 0.60 to 0.75 in clone sets. However, there is a 3–6% differ-
components in each application, and the curve shows the ence between the upper and lower quartile values of each
percentage contribution of each component. The difference clone. In this paper, the CFG path of each Java file is meas-
in percentage denotes that some component has more source ured using static analysis to extract cloned information by
codes than others. Therefore, it generates more number of avoiding the selection of noisy data
features to train the model as compared to other compo- The component-based CFG features are then given to the
nents. For instance, Clone_set3 and Clone_Set1 are major designed deep learning model. Figure 8 shows the visualiza-
components from both cloned applications, which contribute tion of training and testing accuracy for two cloned appli-
73.08%, 43.73% features against other components. Thus, cations. The performance is visualized dynamically on 50
these components can affect the classification more as com- epochs. Blue and orange color presents the train and test
pared to the others. By doing this, we can analyze the impact curves. The epoch is plotted horizontally while accuracy is
of each component on the proposed approach. shown vertically for each data point. In ES file downloader,
the train and test curves started from 70%, 83%, respectively.
3.2 Experimental results Both curves go up to 94% and then parallel to the horizontal
line with more or fewer changes. It can be seen that; the test
Figures 6 and 7 compare the normalized values of differ- curve goes down at the 10th epoch, but soon it performs bet-
ent clone sets selected from two android applications. Each ter against the training curve. The more number epochs help
boxplot shows the upper quartile and lowers quartile values to train the designed deep learning model effectively. But,
of clone sets along with the minimum and maximum nor- after reaching certain data points, the model behaves more
malized values. In Fig. 6, the median values of all clones are or less constant. The overall classification accuracy of the ES
between 0.80 to 0.81, whereas the lower quartile values are file downloader is 94.6%. In VPN Master, the train and test
between 0.15 to 0.17. The upper quartile values of the first curves started from 65% and 78%, respectively. Both curves
two clones are equal to 1.3, and the upper quartile value of go smoothly up to 95% of accuracy and then bent towards
the third clone is 1.5, respectively. The comparison of box- the horizontal line at the 8th epoch. The test curve is slightly
plots shows that the first clone and second clone are very upper than the training curve, but at the 30th epoch, it goes
similar to each other. In other words, the Clone_set1 and down up to 94.8%. Both curves constantly behave with more
Clone_set2 have similar Java files without any difference. or fewer changes at each epoch. The overall classification
Therefore, the normalized values of both clone sets are the accuracy for VPN master is 96.8%. This visualization of

13
International Journal of Machine Learning and Cybernetics

Fig. 6  Visualization of three clone sets selected from “ES File Downloader” android application

Fig. 7  Visualization of three clone sets selected from “VPN Master” android application

training and testing curves at each epoch provides a detailed retrieved from the designed experiment. In ES file down-
analysis and effectiveness of the proposed approach. loader, the train curves perform between 60–10%, while
Figure 9 shows the output of loss for training and test- the test curve performs between 55–20%, respectively. The
ing data points against each epoch. Again, the blue and test curve moves upper than the training curve after the 5th
orange curves represent the train and testing loss data points epoch. The cumulative loss of train and test data points is

13
International Journal of Machine Learning and Cybernetics

Fig. 8  Visualization of accuracy for training and testing data points on different epochs

Fig. 9  Visualization of loss for training and testing data points on different epochs

10% approximately. In VPN master, the train and test curves shown in Fig. 10. The Clone-Set denotes the cloned com-
behave between 80–10% and 55–8%, respectively. The ponents captured from android applications. The vertical
cumulative loss for VPN master is 8%, which is less than values show the true labels, and horizontal values denote
the ES file downloader. These visualization values validate the predicted labels. The diagonal line shows the classifica-
the performance of the proposed approach. The dynamic tion for each Clone_Set, and the remaining entries show-
flow of each train and test curves for accuracy and loss data ing miss classification. For instance, in ES File downloader,
points accurately show the output for clone classification. Clone_Set1, Clone_Set2 and Clone_Set3 classified 100%,
For a detailed analysis of the proposed approach, the con- 97%, 88%, respectively. Similarly, in VPN master, the clas-
fusion matrices are extracted for each clone application, as sification performance of Clone_Set1, Clone_Set2 and

13
International Journal of Machine Learning and Cybernetics

Fig. 10  Confusion matrices for two cloned applications

Table 2  Comparison of the Method Dataset Precision (%) Recall (%) F-Measure (%) CA (%)
proposed approach with state of
the art methods Multi-layer perceptron ES file downloader 92.1 92.1 91.9 92.1
File downloader 95.4 95.5 95.4 95.4
Hangman 93.2 93.4 93.2 93.2
hangman 2 95.2 95.4 95 95
VPN master 92.8 93 93 93
Random forest ES file downloader 91.6 90.7 90.5 90.6
File downloader 93.2 93.2 93.1 93.1
Hangman 92 92.2 92.2 92
hangman 2 93.8 93.6 94 93.8
VPN master 90.6 90.8 91.2 90.6
Convolution neural network ES file downloader 90.4 88.6 87.1 88.5
File downloader 91.9 91.6 91.3 91.6
Hangman 90.2 89.6 89 89.8
hangman 2 91.4 92 92.4 92
VPN master 88.8 89.2 89.2 89.2
Our proposed method ES file downloader 94.2 94.1 94.6 94.6
VPN master 96.4 96.2 96.8 96.8
File downloader 96.8 97.2 97.4 97.2
Hangman 95.4 94.8 95 95.2
hangman 2 97.2 96.6 97.2 97.4

Clone_Set3 are 100%, 91%, 94, respectively. The Clone_ shown in Table 2. Each method is analyzed using five cloned
Set1 has more classification rates as compared to other com- applications with four different performance measures, such
ponents as it has more number of features extracted weighted as precision, recall, f-measure, and classification accuracy.
CFGs. It can be seen that the proposed approach outperformed for
The proposed approach is compared with the state of the all classification measures with an average of 96.24%. The
art methods, such as Multi-Layer Perceptron (MLP), Ran- next better approach is MLP, which gives good classifica-
dom Forest (RF), Convolution Neural Network (CNN), as tion results as compared to RF and CNN. CNN provides

13
International Journal of Machine Learning and Cybernetics

Table 3  Comparison of classification performance among our Twitter, etc.) for sign-up purposes. Conversely, an android
approach and previously published works scammer can use a clone application to recognize the thefts
Published work Year Method Classification of android users. In the future, we are planning to extend our
accuracy (%) clone detection techniques to identify user-credential cloning
attacks in android applications.
Son et al. [41] 2013 Parse Tree Kernel 90.00
Zheng et al. [34] 2017 Weighted AST 87.89
Guo et al. [42] 2018 Abstract Structured Dia- 84.80
gram References
Wang et al. [43] 2019 Graph embedding 89.60
1. Wang D et al (2018) From IoT to 5G I-IoT: the next generation
Wang et al. [44] 2020 Abstract Syntax Tree 95.00
IoT-based intelligent algorithms and 5G technologies. IEEE Com-
Our approach - Component-based CFG 96.24 mun Mag 56(10):114–120
2. Al-Turjman F (2019) 5G-enabled devices and smart-spaces in
social-IoT: an overview. Fut Gen Comput Syst 92:732–744
the lowest classification measures because this model is 3. Al-Turjman F (2019) 5G-enabled devices and smart-spaces in
social-IoT: an overview. Fut Gen Comput Syst 92:732–744
originally developed for image-based classification, while
4. Ul Ain Q et al (2019) A model-driven approach for token based
the proposed research mainly targets the source codes data code clone detection techniques-an introduction to UMLCCD.
in textual form. For a comprehensive analysis, we prepared a In: Proceedings of the 2019 8th International Conference on Edu-
comparison with already published works using the abstract cational and Information Technology
5. Roy CK, Cordy JR (2007) A survey on software clone detection
syntax structured features, as shown in Table 3. Son et al.
research. Queen’s School Comput TR 541(115):64–68
[41] designed a parse tree kernel approach to classify Java 6. Basit HA, Jarzabek S. Efficient token based clone detection with
source codes. They used a single type of programming code flexible tokenization. in Proceedings of the the 6th joint meeting
and got an accuracy of 90%. Next, Fu et al. [34] used a of the European software engineering conference and the ACM
SIGSOFT symposium on The foundations of software engineering.
weighted AST method to estimate the importance of each
2007
hierarchical feature. Then, these weighted values are used to 7. Yu H et al (Neural detection of semantic code clones via tree-
detect source code plagiarism. They got 87.89% classifica- based convolution. in 2019) IEEE/ACM 27th International Con-
tion accuracy in Java files. Our approach outperforms with ference on Program Comprehension (ICPC). 2019. IEEE
8. Ullah F, Al-Turjman F, Nayyar A (2020) IoT-based green city
a classification accuracy of 96.24%, which can describe that
architecture using secured and sustainable android services. Envi-
this approach is more accurate and effective ron Technol Innovat 20:101091
9. Gautam P, Saini H (2017) Non-trivial software clone detection
using program dependency graph. IJOSSP 8(2):1–24
10. Patil SS et al (2017) Code clone detection using hybrid approach.
4 Conclusion In: International Journal of Innovative Research and Creative
Technology. IJIRCT​
Enforcement of software piracy is one of the biggest chal- 11. Zarpelao BB et al (2017) A survey of intrusion detection in
lenges in the Android community, which impacts android Internet of Things. Journal of Network Computer Applications
84:25–37
developers in terms of revenue. Using API access services, 12. Chahid Y, Benabdellah M, Azizi A. Internet of things security.
cloud-hosted android apps can be used to assess clones. in (2017) International Conference on Wireless Technologies,
Clone detection device automation is done by choosing Embedded and Intelligent Systems (WITS). 2017. IEEE
multiple components for Android source files. Components 13. Zarpelao BB et al (2017) A survey of intrusion detection in Inter-
net of Things. J Netw Comput Appl 84:25–37
are further used by means of CFG analysis and function 14. Su X, Chuah M, Tan G. Smartphone dual defense protection
weighting. Training and testing of cloned applications are framework: Detecting malicious applications in android markets.
conducted using the RNN deep learning model. The pro- in (2012) 8th International Conference on Mobile Ad-hoc and
posed method is investigated with five different android Sensor Networks (MSN). 2012. IEEE
15. Zhou Y et al. Hey, you, get off of my market: detecting malicious
application which is cloned in various android stores. apps in official and alternative android markets. in NDSS. 2012
Empirical findings have shown that the proposed model 16. Baker BS (1997) Parameterized duplication in strings: Algorithms
could effectively predict cloned applications selected from and an application to software maintenance. SIAM J Comput
different android stores with an average accuracy of more 26(5):1343–1362
17. Ducasse S, Nierstrasz O, Rieger M (2006) On the effectiveness
than 96%. The findings indicate that most of the cloned of clone detection by string matching. Journal of Software Main-
applications with only a limited overhead of accuracy and tenance Evolution: Research Practice 18(1):37–58
loss are successfully detected by the proposed methods. 18. Smith R, Horwitz S. Detecting and measuring similarity in code
Thus, semantic-based graph features effectively contribute clones. in Proceedings of the International workshop on Software
Clones (IWSC). 2009
to clone detection. Automatic login is a feature commonly 19. Van Rysselberghe F, Demeyer S. Evaluating clone detection tech-
used in android applications that allow users to share infor- niques. in Proceedings of the international workshop on evolution
mation from social media applications (Facebook, Google, of large scale industrial software applications. 2003

13
International Journal of Machine Learning and Cybernetics

20. Jan B et al (2019) Deep learning in big data analytics: a compara- 33. Svacina J, Simmons J, Cerny T. Semantic code clone detection for
tive study. Comput Electr Eng 75:275–287 enterprise applications. in Proceedings of the 35th Annual ACM
21. Rattan D, Bhatia R, Singh M (2013) Software clone detection: A Symposium on Applied Computing. 2020
systematic review. Inf Softw Technol 55(7):1165–1199 34. Fu D et al., Wastk: A weighted abstract syntax tree kernel method
22. Bowyer KW, Hall LO. Experience using” MOSS” to detect cheat- for source code plagiarism detection. Scientific Programming,
ing on programming assignments. in FIE’99 Frontiers in Educa- 2017. 2017
tion. 29th Annual Frontiers in Education Conference. Designing 35. Abadi M et al. Tensorflow: A system for large-scale machine
the Future of Science and Engineering Education. Conference learning. in 12th {USENIX} Symposium on Operating Systems
Proceedings (IEEE Cat. No. 99CH37011 (1999) IEEE Design and Implementation ({OSDI} 16). 2016
23. Burd E, Bailey J. Evaluating clone detection tools for use during 36. Baylor D et al (2017) Tfx: A tensorflow-based production-scale
preventative maintenance. in Proceedings. Second IEEE Inter- machine learning platform. in Proceedings of the 23rd ACM
national Workshop on Source Code Analysis and Manipulation SIGKDD International Conference on Knowledge Discovery and
(2002) IEEE Data Mining. ACM
24. Deokate B, Hanchate DB (2016) Software source code plagiarism 37. Gulli A, Pal S, Deep Learning with Keras (2017) Packt Publishing
detection: a survey. Journal of Multidisciplinary Engineering Sci- Ltd
ence Technology 3(1):3747–3750 38. Agostinelli F et al., Learning activation functions to improve deep
25. Li L et al (Cclearner: A deep learning-based clone detection neural networks. arXiv preprint arXiv:1412.6830, 2014
approach. in 2017) IEEE International Conference on Software 39. Sharma S, Activation functions in neural networks. Towards Data
Maintenance and Evolution (ICSME). 2017. IEEE Science, 2017. 6
26. Lazar F-M, Banias O. Clone detection algorithm based on the 40. Rice L, Wong E, Kolter JZ, overfitting in adversarially robust deep
abstract syntax tree approach. in (2014) IEEE 9th IEEE Inter- learning. arXiv preprint arXiv:2002.11569, 2020
national Symposium on Applied Computational Intelligence and 41. Son J-W et al (2013) An application for plagiarized source code
Informatics (SACI). 2014. IEEE detection based on a parse tree kernel. Eng Appl Artif Intell
27. Rahman W et al (2020) Clone Detection on Large Scala Code- 26(8):1911–1918
bases. in 2020 IEEE 14th International Workshop on Software 42. Guo S, Liu J. An Approach to Source Code Plagiarism Detection
Clones (IWSC). IEEE Based on Abstract Implementation Structure Diagram. in MATEC
28. Sun X et al (2014) Detecting code reuse in android applications Web of Conferences (2018) EDP Sciences
using component-based control flow graph. in IFIP international 43. Wang C et al. Go-clone: graph-embedding based clone detector
information security conference. Springer for Golang. in Proceedings of the 28th ACM SIGSOFT Interna-
29. White M et al (Deep learning code fragments for code clone tional Symposium on Software Testing and Analysis. 2019
detection. in 2016) 31st IEEE/ACM International Conference on 44. Wang W et al (Detecting Code Clones with Graph Neural Network
Automated Software Engineering (ASE). 2016. IEEE and Flow-Augmented Abstract Syntax Tree. in 2020) IEEE 27th
30. Falcón R et al., Rough clustering with partial supervision, in International Conference on Software Analysis, Evolution and
Rough Set Theory: A True Landmark in Data Analysis. 2009, Reengineering (SANER). 2020. IEEE
Springer. p. 137–161
31. Wijesiriwardana C, Wimalaratne P. Component-based experimen- Publisher’s Note Springer Nature remains neutral with regard to
tal testbed to faciltiate code clone detection research. in (2017) jurisdictional claims in published maps and institutional affiliations.
8th IEEE International Conference on Software Engineering and
Service Science (ICSESS). 2017. IEEE
32. Gabel M, Jiang L, Su Z. Scalable detection of semantic clones.
in Proceedings of the 30th international conference on Software
engineering. 2008

13

You might also like