You are on page 1of 11

FEATURE ARTICLE: IT SECURITY-VULNERABILITY DETECTION

Malware Detection and Classification


Based on Graph Convolutional Networks
and Function Call Graphs
Hsiang-Yu Chuang, Jiann-Liang Chen , and Yi-Wei Ma , National Taiwan University of Science and
Technology, Taipei, 106335,Taiwan

New types and variants of malware are constantly and rapidly being developed.
Identifying malware effectively and quickly has become a primary goal of information
security analysts. This study proposes a malware detection and classification model
that is based on graphical convolutional networks and function call graphs. Analyzing
the behavior of malware executions through sandboxes yields the association between
function calls and functions, enabling a graph that represents the behavior of malware
to be constructed. Using the application programming interfaces (APIs) that are called
by the software as nodes, the call relationships between APIs as edges, and the
underlying semantics of APIs as node features, the behavior of malware is obtained by
subgraph integration. The results show that the accuracy and precision of the detection
model are 0.945 and 0.95, respectively, and the accuracy and precision of the
classification model are 0.926 and 0.93, respectively. These results are better than those
for previously developed methods.

I
n recent years, following the development of informa- and send phishing e-mails that contain COVID-19-
tion technology, digital transformation has become related information. Due to the severity of COVID-19,
necessary for governments and major enterprises. the average user often does not pay attention to the
Information technology is indispensable in the develop- source of the message or the files that are attached
ment of innovative products and services, establishing to the message, allowing attackers to gain access to
new business models, or digitalizing the production their victims’ systems.2 In addition, the digitalization of
and supply chains of chemical plants. However, new factories and businesses is a major global trend, and
cyberthreats have invaded the Internet of Things (IoT) companies are introducing big data and artificial intelli-
domain from the traditional information environment gence to enhance their competitiveness. As a result,
and become an omnipresent threat. According to the many devices will transmit more data over the Internet,
Global Information Security Threat Report published by including, of course, commercially valuable and confi-
Fortinet, the number and sophistication of attacks dential information. These devices and information will
against individuals, enterprise organizations, and critical become the target of attackers. Many attacks in recent
infrastructure globally increased significantly in the first years have been launched by state-level organizations,
half of 2021.1 Therefore, enterprises of all sizes are often on critical national facilities. According to antivi-
actively addressing cyberattacks in all instances of risk. rus test (AV-TEST) statistics that were published in
Enterprises need to be proactive in preventing damage 2022, the number of types of malware that target Win-
by attackers. dows operating systems has grown steadily since 2018
Owing to the impact of COVID-19, many cybercrimi- at 10–25% annually.3
nals have exploited COVID-19 to spread malicious files Most current strategies of defense against malware
are based on static analysis methods, such as signa-
ture matching. The advantage of static analysis is that
1520-9202 © 2023 IEEE
it covers the entire code of a piece of malware and can
Digital Object Identifier 10.1109/MITP.2023.3264509 be conducted without executing it. However, some
Date of current version 30 June 2023. defense strategies that are based on dynamic analysis

Authorized
May/June licensed use limited to: Northeastern
2023 PublishedUniversity. Downloaded
by the IEEE Computeron July 28,2023 at 12:02:25 UTC from
Society IT IEEE Xplore. Restrictions apply.
Professional 43
IT SECURITY-VULNERABILITY DETECTION

have also been developed; these use such features as Signature-Based Mechanism
sensitive behavior, critical access to systems, network Much antivirus software relies on file signatures to
traffic analysis, and key program monitoring.4 Since detect malware. This technique involves reading or scan-
dynamic analysis is performed on the actual behavior ning a file to determine whether it matches a signature
of malware, it can effectively detect and classify mal- in the signature database. The signature information
ware variants. Most of the approaches focus on specific includes the file size, the function that is used in the
types of malware, so developing new and generalized sample, data bytes at certain positions, the printable
detection methods is essential. string, and the hash value of the entire file. Signatures
This study proposes an efficient way to analyze the can be generated automatically using appropriate tools.4
behavior of a Windows portable executable (PE). Graph The advantage of this method is that it can detect mal-
convolutional networks (GCNs) and function call graphs ware quickly, is simple and lightweight, and has a low
are used to detect and classify malware. This study will false-positive rate. However, it is susceptible to generat-
identify and classify software types using neural net- ing false positives when new software is encountered,
works and sandbox analysis results. Researchers will whether the sample is malicious or benign. Since these
be able to use this method to classify software types new software signatures do not appear in the database,
quickly. analysts must maintain the signature database regularly
The contributions of this article to its field are as to ensure the accurate detection of malicious software
follows: and reduce the false-positive rate.
Naik et al.5 proposed a fuzzy–import hashing tech-
An approach to classify software types that is
nique, which integrates two methods: fuzzy hashing
based on dynamic analysis is developed. Malware
and import hashing. The general hashing algorithm
is run in a sandbox, and the function call relation-
identifies whether files are identical. The principle of
ships are analyzed to generate call graphs.
fuzzy hashing is to divide a file into several blocks to
The Weisfeiler–Lehman graph kernel-based GCN
perform hashing operations separately on them; the
(WL-GCN) model is used for general malware
similarity is then calculated using this hash value.
detection. A malware multiclassification task is
Many factors affect fuzzy hashing, including the file
carried out so that the samples can be identified
size, block size, and type of hash function. Associated
as malware by model analysis.
methods include SSDEEP,6 SDHASH,7 and mvHASH-B.8
The accuracy that is achieved in this work is
around 7% higher than that achieved in previ- The study compensates for the shortcomings of both
ous studies. By enhancing the node features, fuzzy hashing and import hashing by integrating them.
the call graph can be made to better represent The results thus obtained show that this integrated
the behavior and patterns of malware. approach improves the rate of analysis and the detec-
tion of malware and is much more effective than fuzzy
hashing or import hashing alone. However, it is based
RELATED WORKS on the structural or syntactic similarity or does not
Malware is an all-embracing term. Any software that detect behavioral and semantic similarities, disfavoring
acts maliciously or is intended to damage and exploit a the detection of new malware.
programmable device can be called malware. Cyberat-
tackers often use malware to obtain sensitive informa- Heuristic-Based Mechanism
tion with the aim of gaining financial benefit at the Signature-based detection techniques are not effective
expense of the victim. Stolen data are usually credit in detecting unknown malware, but heuristic-based
card information and personal account passwords. In detection techniques can infer sample behavior through
recent years, due to the popularity of cryptocurrencies, a series of features to identify unknown malware. Heu-
attackers have also planted malware on victims’ devi- ristic detection techniques can be divided into static
ces to “mine” them for cryptocurrency. Each type of and dynamic depending on the types of features used.9
malware has different objectives, such as stealing infor- Static heuristics detection enables analysis without
mation from victims’ accounts, running forced adver- executing suspicious files, using, for example, printable
tisements, and spamming and extorting money. This strings, OpCode, and control flow diagrams—features
chapter introduces recent malware-related research, that are normally available through decompilation
including signature-based techniques, heuristic techni- techniques. These features can be compared to mal-
ques, and recent work on machine learning and deep ware features to recover the behavior of the program
learning techniques. and to detect new types of malicious samples. The

44 Authorized licensed
ITuse limited to: Northeastern University. Downloaded on July 28,2023 at 12:02:25 UTC from IEEE Xplore. Restrictions
Professional May/June apply.
2023
IT SECURITY-VULNERABILITY DETECTION

advantages of this approach are the lack of need to from data by simulating the learning and operation pro-
run malware samples during the analysis as well as the cess of human neurons. Although the computational
speed and low cost. However, malware developers can and resource costs are higher than those of machine
make static heuristics detection less efficient though learning, the predictions are usually better.
code obfuscation and the use of packer techniques.10 Sharma et al.13 proposed a detection method that is
Dynamic heuristics detection is also often referred based on OpCode execution frequency; it uses the
to as a behavior-based approach, wherein an isolated Fisher score, information gain, and chi-square for feature
system environment actually simulates and records selection and uses the random forest, logistic model
malware behavior and determines whether a sample is tree, and Named Binary Tags (NBT) classifier in Waikato
malicious or benign. This type of approach usually fea- Environment for Knowledge Analysis (WEKA) for mal-
tures API call, file system, and registry operations to ware detection. Their results show that this method can
record the actual interaction of malware with the effectively detect malware. Nguyen et al.14 proposed an
system and to infer whether its behavior is malicious IoT botnet detection approach that is based on a print-
or not. able string information graph and deep graph convolu-
Pektaş and Acarman11 used mining and searching tional neural network classifier. They used the IDA Pro
n-gram algorithms on API sequences to represent tool to generate control flow graphs, printable string
the behavioral features of malware, and they thereby information to build call graphs, and convolutional neu-
extracted malicious API patterns using a proposed vot- ral networks (CNNs) for classification.
ing algorithm to identify the malware type. Cabau Yuan et al.15 proposed a byte-level malware classifi-
et al.12 proposed an automatic classification system cation method that is based on Markov images and
that classifies suspicious samples as malicious or deep learning. The results thus obtained showed that
benign based on the characteristics of the file system, this method is more efficient than gray-scale image-
registry operations, and network access operations. based methods. Various authors16–18 have performed
Their study was the first to extract behavioral features the image transformation of executable files and used
from an isolated environment; quantify those features CNN models (e.g., VGG16 and ResNet-50) to classify
using a proposed algorithm; and, finally, use the sup- malware.
port vector machine classifier for identification. Li et al.19 proposed a classifier that is based on a
Dynamic heuristic detection is less affected than GCN. Their method uses malware API sequences to
static heuristic detection by code obfuscation and generate a directed cyclic graph. The results show that
packer technologies. However, dynamic heuristic detec- the method is effective in detecting malware and out-
tion also has shortcomings. First, the isolated environ- performs previously studied methods. Hung et al.20
ment and the associated dynamic analysis tools are proposed a malware detection method that is based
easily detected by malware, which will then evade those on a multiedge dataflow graph representation and a
tools, resulting in poor analysis efficiency. Another dis- CNN. They also proposed a model framework called
advantage is that the method is costlier and slower MalGCN for malware detection.
than static heuristic detection. Therefore, both methods
can be combined to compensate for the shortcomings PROPOSED METHODS
of each and, thus, effectively deal with inverse analysis This section introduces an overview of the proposed
technologies. For example, the use of antidebugging or methods and describes the dataset and model in detail.
code obfuscation techniques by malware developers Figure 1 shows the proposed system, which has four
seriously affects the results of static analysis, but not main parts—data collection, dataset creation, detec-
those of dynamic analysis. Antidebugging techniques tion model and prediction, and classification model and
that target more than two types of analysis are not stan- prediction.
dard in common malware.10
Data Collection
Artificial Intelligence-Based HatchingTriage (https://tria.ge/) is an online sandbox ser-
Mechanisms vice that is provided by Hatching International B.V. in
Machine learning requires a manual selection of the The Netherlands; it can perform dynamic software analy-
features of data, which are then learned and classified sis on various platforms. HatchingTriage stores many
using a model. The selected features significantly malware samples, providing the family name, tags, and
affect the classification results of the model; deep Secure Hash Algorithm 256 (SHA256) hash of each mal-
learning can support the automatic learning of features ware sample. PortableFreeware (www.portablefreeware.

Authorized
May/June licensed use limited to: Northeastern University. Downloaded on July 28,2023 at 12:02:25 UTC from
2023 IT IEEE Xplore. Restrictions apply.
Professional 45
IT SECURITY-VULNERABILITY DETECTION

FIGURE 1. System overview. API: application programming interface; PE: portable executable.

com) provides many samples of benign software. This sandbox provides information about the file system,
study crawls these samples from PortableFreeware registry values, network traffic, and function calls; it
using Python crawler technology. The sample files that generates analysis reports in JavaScript Object Nota-
are collected in this study are of the Windows PE type. tion format, as shown in Figure 3.
One thousand samples of each of eight types of malware By tracking the execution of each sample with
were downloaded from Triage. A total of 8000 malicious Cuckoo Sandbox, the API call sequence of all pro-
samples and 2000 benign samples were thus collected. cesses during the execution period are obtained, and
The six categories of malicious samples are adware, the respective call graph for each process is drawn, as
backdoor, downloader, dropper, miner, and ransom. shown in Figure 4. After the API execution sequence is
received, the sequence is converted into a graph,
Data Preprocessing which is defined as follows: G ¼ fV , E g, where V is the
Figure 2 presents the steps of dataset creation. The set of vertices, V ¼ fv1 , v2 , v3 , :::, vn g, E is set of edges,
first step is to submit the samples to the Cuckoo Sand- and E ¼ fðx, yÞjðx, yÞ 2 V 2 g. Each vertex represents an
box for dynamic analysis and, thus, to obtain the exe- API function.
cution analysis report and API execution sequence. When API A is executed and then API B is executed,
The second step is to draw the call graph based on the an edge is created between nodes A and B. The direc-
API execution sequence. Then, the API names in the tion of the edge depends on the order of execution of
API execution sequence are used to train the word2vec the two APIs. During execution, two identical APIs may
model to convert the API information into a vector, be executed consecutively so that the graph that is
which is used as the feature of each node of the graph. drawn in this way is a directed cyclic graph. The bene-
fits of employing a call graph include the following:
Sandbox Analysis 1) the quick comprehension of code, such as the ability
To collect information about the behavior of malware to locate subfunctions that are not called by other pro-
during execution, Cuckoo Sandbox is used to perform grams; 2) the ability to monitor the change of variable
a dynamic analysis. The sandbox provides a secure values in each subfunction for subsequent analysis;
execution environment to prevent malware from infect- and 3) the identification of irregular program execution,
ing an entire system. After the samples are run, the such as a code injection attack.

FIGURE 2. Dataset creation steps.

46 Authorized licensed
ITuse limited to: Northeastern University. Downloaded on July 28,2023 at 12:02:25 UTC from IEEE Xplore. Restrictions
Professional May/June apply.
2023
IT SECURITY-VULNERABILITY DETECTION

Call Graph Integration


Many non-PE software samples were found during
sandbox analysis. The file command in Linux was used
to filter and classify the files. After sandboxing and the
generation of the call graph, it was found that, among
the dataset with a total of 10,000 samples (8000 mali-
cious and 2000 benign), more than 1000 samples con-
tained more than 200 nodes, and some of those
contained even more than 1000 nodes, as shown in
Figure 5, because the samples generated many behav-
iors during execution, such as creating other pro-
cesses, copying themselves, and operating through
other system programs, causing many API calls and
producing a large number of nodes in the call graph.
FIGURE 3. Sandbox analysis reports.
Therefore, in this study, the nodes in the call graph are
pruned by the following method:
API Embedding [ [
The call graph that is generated in these steps has cap- G ¼ Sumðgi , gj Þ gi  Sumðgi , gj Þ fgj  Sumðgi , gj Þg
tured only an association between APIs and APIs with-
out any of the characteristics of APIs, so this section where gi ¼ ðNi , Ei Þ and gj ¼ ðNj , Ej Þ are the call graph
explains how to perform API embedding. This study is of the process generated during the execution of the
based on the use of API pairs proposed by Shamsi sample. By overlaying the subcall graph (gi and gj ), the
et al.,21 in which the upward and downward call relation- behavioral process of each process is added to G. The
ships of APIs are learned through Word2Vec and vec- “Sum” indicates that the convolution output is summed.
torized. The category of each API—such as file, system, In this way, the behavioral information associated with
or resource—is added to vectors to represent their each process can be maintained while the number of
behavioral features. This study is based on the Gensim- graph nodes is reduced. After pruning, the number of
Word2Vec22 open source library for API embedding. The call graph nodes is below 200.
corpus of the model was built using API sequences “Trim” removes a redundant part of a model, and
from sandbox analysis reports, and a total of 6000 soft- “unTrim” means to not remove a redundant part of a
ware samples were used to generate 301 API functions. model. Figure 5 presents the statistics after pruning.
The training model uses a skip-gram to obtain the Some of the samples that were collected during the
semantics of the current API by the preceding and fol-
data collection phase could not be executed directly in
lowing APIs; it uses these word vectors as node fea-
the sandbox. Some of those samples are of types, such
tures of the call graph.
as Microsoft Excel spreadsheet, Visual Basic Script,
and others, that are not included in this study; only the
executable type is considered herein.

FIGURE 4. Call graph. FIGURE 5. Nodes in call graph.

Authorized
May/June licensed use limited to: Northeastern University. Downloaded on July 28,2023 at 12:02:25 UTC from
2023 IT IEEE Xplore. Restrictions apply.
Professional 47
IT SECURITY-VULNERABILITY DETECTION

FIGURE 6. Model architecture. GCN: graph convolutional network.

Model Architecture the LogSoftMax function and transformed into the


Figure 6 shows the architecture of the classification probability distribution of each type of sample. Table 1
model. First, GCNConv, which is an aggregate and shows the parameters of the model. In this classifica-
combined strategy for learning these graph nodes and tion model, the learning rate is 103 , the batch size is
structural information, is used. Then, the GCN convolu- 100, the optimization function is Adam, and the loss
tional layer is used for three iterations, and the features function is CrossEntropyLoss.
of each node are stacked. Second, the global sort pool
and 1-D CNN, which is a readout strategy, are used to PERFORMANCE ANALYSIS
sort features based on the last layer of the GCN; then, The following section describes the experimental envi-
the top 30 values are selected, and the 1-D CNN and ronment and compares the system performance with
the full connection layer are entered to transform any the performance of the previously developed system.
simple graph into an embedding vector. Third, the linear
transformation is applied to the embedding vector to Experimental Environment
convert it into two dimensions (two dimensions for the Table 2 lists the hardware devices used in this study.
detection task and eight dimensions for the multiclassi- Hardware includes an AMD Ryzen 7 3700x eight-core
fication task). Finally, the embedding vector is input to CPU at 3.6 GHz, a DDR4 2  32-GB global sort pool

TABLE 1. Parameters of the model.*

Layer (Input, Output) Activation Remarks


GCNConv (N, 32) Tanh Convolution
N is the features of the node.
GCNConv (32, 32) Tanh Convolution
GCNConv (32, 32) Tanh Convolution
GCNConv (32, 1) Tanh Convolution
Global sort pool K ¼ 30
Conv1d (1, 16) ReLu Convolution
MaxPool1d Kernel size ¼ 2,
stride ¼ 2
Conv1d (16, 32) ReLu Convolution
Linear (352, 128) ReLu
Linear (128, X) Log SoftMax X is the number of classes.
*ReLu: rectified linear unit; TanH: tangens hyperbolicus.

48 Authorized licensed
ITuse limited to: Northeastern University. Downloaded on July 28,2023 at 12:02:25 UTC from IEEE Xplore. Restrictions
Professional May/June apply.
2023
IT SECURITY-VULNERABILITY DETECTION

TABLE 2. Hardware specifications.* TABLE 4. Dataset distribution.

Item Specifications Trainable


Number Type Downloaded data
CPU AMD Ryzen 7 3700x eight-core CPU at
3.6 GHz 1 Adware 1000 223
RAM DDR4 2  32-GB RAM 2 Backdoor 1000 415
HDD 2-TB HDD 3 Downloader 1000 364
GPU NVIDIA GeForce GTX 1080 8 GB 4 Dropper 1000 757
*RAM: random-access memory; HDD: hard disk drive. 5 Miner 1000 784
6 Ransomware 1000 633
random-access memory, a 2-TB HDD, and an NVIDIA
7 Spyware 1000 368
GeForce GTX 1080 8-GB graphics card.
Since it is not certain whether the behavior of the 8 Worm 1000 187
suspect sample will cause system damage, dynamic anal- 9 Benign 2000 1784
ysis requires that the sample be executed in an isolated
environment to perform the sample’s dynamic behavior
analysis safely. This study uses the Ubuntu operating sys- Results of the Analysis
tem as a Cuckoo host and Windows 7 64 bit as a guest Detection Task
In this study, 150 epochs were used for the detection
system. The guest system is installed on VirtualBox, and
part of the training. This number was chosen because
Microsoft Office, Adobe Acrobat Reader DC, and other
the model would be overfitted after more than 150
common user applications are installed beforehand to
epochs. Figure 7 shows the training history of the
match real-world usage conditions closely. Table 3 pro-
model for the detection task. As seen in the figure, the
vides host and guest system information.
model improves and converges quickly after about 10
epochs. The unbalance of the dataset causes the
Dataset Distribution model to prefer classes with a higher number of predic-
Table 4 shows the distribution of data in the datasets. tions in the initial training phase. After 150 epochs, the
The number of downloads, type name, and data that training and validation accuracies of the model were
finally were used in training are shown for each sample around 0.97.
type. The dataset contains adware, backdoor, down- After the model was trained, a test dataset was
loader, dropper, miner, ransomware, spyware, worm, used to make predictions. The predictions were con-
and benign, for a total of 3731 malicious samples and verted into a confusion matrix. According to the confu-
1784 benign samples. sion matrix, the predictions of the model are biased
To allow the classification model to be trained and toward malware. The imbalance in the dataset causes
validated correctly in each epoch, the test data that the predictions of the model to be biased because there
are not included in the training set are used for reason- is about twice as much malware as benign software.
able model performance evaluation. In this study, data However, the prediction accuracy for benign software is
concerning 100 randomly selected samples of each still 89%. Figure 8 shows the indicators. The predictions
sample type in the dataset were used as test sets, and of the test dataset had an accuracy of 0.945, a precision
the remaining data were divided into 90% for the train- of 0.95, a recall of 0.945, an F1 score of 0.945, and a
ing dataset and 10% for the validation dataset. macro area under the curve (AUC) of 0.94.
TABLE 3. Host and guest system information.*
Multiclassification Task
Item Specifications The model was trained for 200 epochs in the multiclassi-
fication task to prevent overfitting, as in the malware
Cuckoo Sandbox Version 2.0.7
classification tasks. Figure 9 shows the training history
Host OS Ubuntu 16.04 LTS of the model in the multiclassification task. It shows that
Host RAM 16 GB the model improves and converges rapidly after about 10
epochs. After 200 epochs, the training accuracy of the
Guest OS Windows 7 64 bit
model is 96%, and the validation accuracy is 90%.
Guest RAM 4 GB After the model has been trained, a test dataset is
*OS: operating system. used to make predictions. The predictions are plotted

Authorized
May/June licensed use limited to: Northeastern University. Downloaded on July 28,2023 at 12:02:25 UTC from
2023 IT IEEE Xplore. Restrictions apply.
Professional 49
IT SECURITY-VULNERABILITY DETECTION

FIGURE 7. Training history for the detection task. AUC: area under the curve; ROC: receiver operating characteristic.

FIGURE 8. Performance for the detection task. acc: accuracy.

FIGURE 9. Training history for the multiclassification task.

50 Authorized licensed
ITuse limited to: Northeastern University. Downloaded on July 28,2023 at 12:02:25 UTC from IEEE Xplore. Restrictions
Professional May/June apply.
2023
IT SECURITY-VULNERABILITY DETECTION

FIGURE 10. Performance for the multiclassification task.

as a confusion matrix. After training with benign software, Li et al.19 proposed a classifier that was based on a
the accuracy of the detection of adware and backdoor GCN. The method uses malware API sequences to gen-
improves because some of the behavioral characteristics erate a directed cyclic graph. However, they did not pro-
of these samples are similar to those of the detection vide the details of their experiments. In particular, they
of benign software. However, the accuracy of ransom- failed to explain 1) how the APIs were embedded,
ware and benign is relatively low because some of the 2) how the call graph of each program was integrated
ransomware samples use antisandboxing techniques into the dynamic analysis, and 3) the model hyperpara-
and do not perform any malicious operations after meters. Therefore, the approach attempted to be imple-
entering the sandbox, making them similar to benign mented in this study differs from that of Li et al. but is
samples. Figure 10 shows the performance metrics of conceptually similar to it. For the GCN model to per-
the model in the multiclassification task, including the form learning and graph classification tasks on a call
confusion matrix, the receiver operating characteristic graph to represent the behavior of the whole graph,
curve, and other metrics. The predictions of the test nodes must be defined in advance. In addition, the
dataset had an accuracy of 0.926, a precision of 0.930, a nodes of various graphs differ. This study uses one-hot
recall of 0.926, an F1 score of 0.926, and a macro AUC encoding for API embedding, potentially generating too
of 0.96. many nodes, but the method suffices to represent the
behavior of and information about each API. Global_
Performance Comparison Max_Pool is the readout strategy and, finally, SoftMax
This section measures the performance by comparing is used to output the final result.
the similarity model with previous studies. This study Table 5 compares the performance of the graph
uses the test dataset for comparison to ensure consis- method and model that are developed in this study
tency of the benchmark. to that achieved in other studies. The graph method of

TABLE 5. Performance metrics of comparison with previous studies.*

Detection task Multiclassification task


Graph method Markov chain Proposed method Markov chain Proposed method
Model GCN WL-GCN WL-GCN WL-GCN
Precision 0.896 0.950 0.856 0.930
Recall 0.875 0.945 0.856 0.926
F1 score 0.873 0.945 0.854 0.926
Accuracy 0.875 0.945 0.856 0.926
*GCN: graph convolutional network; WL-GCN: Weisfeiler–Lehman graph kernel-based graph convolutional network.

Authorized
May/June licensed use limited to: Northeastern University. Downloaded on July 28,2023 at 12:02:25 UTC from
2023 IT IEEE Xplore. Restrictions apply.
Professional 51
IT SECURITY-VULNERABILITY DETECTION

Li et al.19 uses Markov chain-based methods to gener- 6. J. Kornblum, “Identifying almost identical files using
ate call graphs and perform graph classification. In the context triggered piecewise hashing,” Digit.
model part, they use a traditional GCN to detect mal- Investigation, vol. 3, pp. 91–97, Sep. 2006, doi: 10.1016/
ware. As the model that was used by Li et al. did not j.diin.2006.06.015.
perform well in the multiclassification task, the graph 7. V. Roussev, “Data fingerprinting with similarity
method is compared with the Markov chain-based digests,” in Proc. Adv. Digit. Forensics VI, 2010,
approach only in the multiclassification task. pp. 207–226, doi: 10.1007/978-3-642-15506-2_15.
8. F. Breitinger, K. P. Astebøl, H. Baier, and C. Busch,
“mvHash-B – A new approach for similarity preserving
CONCLUSION
hashing,” in Proc. Int. Conf. IT Secur. Incident Manage.
This study proposes and successfully applies a method
IT Forensics, 2013, pp. 33–44, doi: 10.1109/IMF.2013.18.
for malware detection and classification that is based € A. Aslan and R. Samet, “A comprehensive review
9. O.
on GCNs. Samples of software are analyzed using a
on malware detection approaches,” IEEE Access,
sandbox to obtain the actual behaviors and functions
vol. 8, pp. 6249–6271, Jan. 2020, doi: 10.1109/ACCESS.
that are executed by the malware and to build a func-
2019.2963724.
tion call graph. Then, the behaviors and features of each 10. Z. Guo, W. Zhang, W. Yang, X. Che, Z. Zhang, and
API are represented by embedding the API call sequen- M. Li, “A survey on feature extraction methods of
ces into the call graph. Finally, the API representation heuristic backdoor detection,” in Proc. Int. Conf.
vector is used as a feature in the call graph nodes, and Frontiers Electron., Inf. Comput. Technol., 2021,
the function call graph can represent the malware pp. 1–7, doi: 10.1145/3474198.3478137.
behavior. GCNs are used for malware detection and 11. A. Pektaş and T. Acarman, “Malware classification
classification. Our future work will focus on antidetec- based on API calls and behaviour analysis,” IET Inf.
tion, including antisandboxing technology; doing so will Secur., vol. 12, no. 2, pp. 107–117, Mar. 2018,
make the analysis more complicated. Future research doi: 10.1049/iet-ifs.2017.0430.
on antidetection must be conducted to reduce malware 12. G. Cabau, M. Buhu, and C. P. Oprisa, “Malware
evasion. In addition, by mining the common APIs of mal- classification based on dynamic behavior,” in Proc.
ware and adding to the model, such as the graph atten- 18th Int. Symp. Symbolic Numer. Algorithms Scientific
tion network model, the call graph can become more Comput. (SYNASC), 2016, pp. 315–318, doi: 10.1109/
discriminating. SYNASC.2016.057.
13. S. Sharma, C. R. Krishna, and S. K. Sahay, “Detection
of advanced malware by machine learning
REFERENCES techniques,” in Proc. Adv. Intell. Syst. Comput., 2019,
1. “Global threat landscape report,” Fortinet, Sunnyvale, pp. 333–342, doi: 10.1007/978-981-13-0589-4_31.
CA, USA, Aug. 2021. Accessed: Apr. 20, 2022. [Online]. 14. H. T. Nguyen, Q. D. Ngo, and V. H. Le, “IoT botnet
Available: https://www.fortinet.com/content/dam/ detection approach based on PSI graph and DGCNN
maindam/PUBLIC/02_MARKETING/08_Report/report- classifier,” in Proc. IEEE Int. Conf. Inf. Commun. Signal
2021-threat%20landscape.pdf Process. (ICICSP), 2018, pp. 118–122, doi: 10.1109/
2. “Sonicwall cyber threat report,” SonicWall, Milpitas, ICICSP.2018.8549713.
CA, USA, 2021. Accessed: Apr. 21, 2022. [Online]. 15. B. Yuan, J. Wang, D. Liu, W. Guo, P. Wu, and X. Bao,
Available: https://www.sonicwall.com/medialibrary/ “Byte-level malware classification based on markov
en/white-paper/2021-cyber-threat-report.pdf images and deep learning,” Comput. Secur., vol. 92,
3. “AV-ATLAS analyzes.” AV-TEST. Accessed: Mar. 5, May 2020, Art. no. 101740, doi: 10.1016/j.cose.2020.
2022. [Online]. Available: https://portal.av-atlas.org/ 101740.
4. N. Naik, P. Jenkins, R. Cooke, J. Gillett, and Y. Jin, 16. F. Zhong, Z. Chen, M. Xu, G. Zhang, D. Yu, and
“Evaluating automatically generated YARA rules and X. Cheng, “Malware-on-the-brain: Illuminating malware
enhancing their effectiveness,” in Proc. IEEE Symp. byte codes with images for malware classification,”
Ser. Comput. Intell., 2020, pp. 1146–1153, doi: 10.1109/ IEEE Trans. Comput., vol. 72, no. 2, pp. 438–451,
SSCI47803.2020.9308179. Feb. 2022, doi: 10.1109/TC.2022.3160357.
5. N. Naik, P. Jenkins, N. Savage, L. Yang, T. Boongoen, 17. D. Gibert, C. Mateu, J. Planes, and R. Vicens, “Using
and N. Iam-On, “Fuzzy-import hashing: A static convolutional neural networks for classification of
analysis technique for malware detection,” Forensic malware represented as images,” J. Comput. Virol.
Sci. Int., Digit. Investigation, vol. 37, Jun. 2021, Hacking Techn., vol. 15, no. 1, pp. 15–28, Mar. 2019,
Art. no. 301139, doi: 10.1016/j.fsidi.2021.301139. doi: 10.1007/s11416-018-0323-0.

52 Authorized licensed
ITuse limited to: Northeastern University. Downloaded on July 28,2023 at 12:02:25 UTC from IEEE Xplore. Restrictions
Professional May/June apply.
2023
IT SECURITY-VULNERABILITY DETECTION

18. D. Vasan, M. Alazab, S. Wassan, B. Safaei, and University of Science and Technology, Taipei, 106335, Taiwan.
Q. Zheng, “Image-based malware classification using His research interests include machine learning, deep learn-
ensemble of CNN architectures (IMCEC),” Comput. ing, and malware detection. Chuang received his M.S. degree
Secur., vol. 92, May 2020, Art. no. 101748, doi: 10.1016/j. in electrical engineering from National Taiwan University of
cose.2020.101748.
Science and Technology. Contact him at m10907504@gapps.
19. S. Li, Q. Zhou, R. Zhou, and Q. Lv, “Intelligent malware
ntust.edu.tw.
detection based on graph convolutional network,”
J. Supercomputing, vol. 78, no. 3, pp. 4182–4198,
Feb. 2022, doi: 10.1007/s11227-021-04020-y. JIANN-LIANG CHEN is a distinguished professor and dean of
20. N. V. Hung, P. Ngoc Dung, T. N. Ngoc, V. Dinh Phai, the Department of Electrical Engineering, National Taiwan
and Q. Shi, “Malware detection based on directed University of Science and Technology, Taipei, 106335, Taiwan.
multi-edge dataflow graph representation and His research interests include cellular mobility management,
convolutional neural network,” in Proc. 11th Int. Conf. cybersecurity, personal communication systems, and the
Knowl. Syst. Eng. (KSE), 2019, pp. 1–5, doi: 10.1109/KSE. Internet of Things. Chen received his Ph.D. degree in electrical
2019.8919284. engineering from National Taiwan University. He is a Senior
21. F. Al Shamsi, W. L. Woon, and Z. Aung, “Discovering
Member of IEEE. Contact him at lchen@mail.ntust.edu.tw.
similarities in malware behaviors by clustering of API
call sequences,” in Proc. Int. Conf. Neural Inf. Process.,
YI-WEI MA is an assistant professor with the Department of
2018, pp. 122–133, doi: 10.1007/978-3-030-04212-7_11.

22. R. Rehůřek and P. Sojka, “Software framework for topic Electrical Engineering, National Taiwan University of Science
modelling with large corpora,” in Proc. LREC Workshop and Technology, Taipei, 106335, Taiwan. His research interests
New Challenges NLP Frameworks, 2010, pp. 45–50. include the Internet of Things, cloud computing, and future
network. Ma received his Ph.D. degree in engineering science
HSIANG-YU CHUANG is a graduated student with the from National Cheng Kung University. He is the corresponding
Department of Electrical Engineering, National Taiwan author of this article. Contact him at ywma@mail.ntust.edu.tw.

Authorized
May/June licensed use limited to: Northeastern University. Downloaded on July 28,2023 at 12:02:25 UTC from
2023 IT IEEE Xplore. Restrictions apply.
Professional 53

You might also like