You are on page 1of 10

Available online at www.sciencedirect.

com
Available online at www.sciencedirect.com
ScienceDirect
ScienceDirect
Procedia
Available Computer
online Science 00 (2020) 000–000
at www.sciencedirect.com
Procedia Computer Science 00 (2020) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 176 (2020) 3436–3445

24th International Conference on Knowledge-Based and Intelligent Information & Engineering


24th International Conference on Knowledge-Based
Systems and Intelligent Information & Engineering
Systems
Identification of library functions statically linked to Linux malware
Identification of library functions statically linked to Linux malware
without symbols
without symbols
Shu Akabaneaa, Takeshi Okamotoa,a,*
Shu Akabane , Takeshi Okamoto *
Kanagawa Institute of Technology, 1030 Shimo-ogino, Atsugi, Kanagawa 243-0292, JAPAN
Kanagawa Institute of Technology, 1030 Shimo-ogino, Atsugi, Kanagawa 243-0292, JAPAN

Abstract
Abstract
Many Linux malware have been found to have statically linked library functions. Much of this malware are stripped of function
Many
names Linux malware hindering
and addresses, have beenfunction-level
found to haveanalysis.
staticallyFor
linked library functions.
function-level analysis,Much of this malware
we identified are stripped
library functions of function
stically linked
names
to 2,256and addresses,
malware hindering
samples function-level
with the Intel 80386 analysis. For by
architecture function-level analysis,
matching patterns. Thewepattern
identified libraryidentified
matching functionsmore
stically
thanlinked
90%
to
of 2,256 malware
the library samplesforwith
functions the Intel
97.7% 80386
of the architecture
samples. by matching
Thus, pattern patterns.
matching Theeffective
can be pattern matching
for libraryidentified more than
identification. Only90%
12
of the library
toolchains hadfunctions
been usedfor 97.7%99.8%
to build of theofsamples.
samples,Thus,
and 11pattern
of the matching
toolchainscan
are be effective
available on for library identification.
the Internet. The C libraryOnly
used 12
by
toolchains
the malware had
wasbeen usedinto96.5%
uClibc build of
99.8% of samples,
the samples, musland
in 11
1.3%of the
andtoolchains
GLIBC in are available on the Internet. The C library used by
2.0%.
the malware was uClibc in 96.5% of the samples, musl in 1.3% and GLIBC in 2.0%.
© 2019 The Author(s). Published by Elsevier B.V.
© 2019
© 2020 The
The Author(s).
Authors. Published
Published bybyElsevier
ElsevierB.V.
B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open
Peer-review access
under article under
responsibility of the CC
KES BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
International.
Peer-review under responsibility of the scientific committee of the KES International.
Peer-review under responsibility of KES International.
Keywords: Libray function identification, library identification, toolchain identification, Linux malware analysis, pattern matching
Keywords: Libray function identification, library identification, toolchain identification, Linux malware analysis, pattern matching

1. Introduction
1. Introduction
Cyberattacks from malware-infected IoT devices have been observed around the world since 2016, with Kaspersky
Cyberattacks
reporting from malware-infected
100 million attacks detected inIoTthedevices have
first half ofbeen
2019observed around
alone [1]. Sincethe world
many IoTsince 2016,run
devices with
onKaspersky
Linux for
reporting 100 million attacks detected in the first half of 2019 alone [1]. Since many IoT devices
embedded devices, the amount of Linux malware is increasing rapidly, resulting in the demand for analysis run on Linux for
of Linux
embedded devices, the amount of Linux malware
malware and more sophisticated analysis techniques. is increasing rapidly, resulting in the demand for analysis of Linux
malware
Malwareandanalysis
more sophisticated analysis
can be divided techniques.
into surface, dynamic and static analyses. Surface analysis examines the similarity
Malware analysis can be divided into surface,
to previously analyzed malware by checking character dynamicstrings
and static
and analyses.
function Surface analysis examines
names contained the similarity
in the malware file to
to previously analyzed malware by checking character strings and function names contained in the malware file to

* Corresponding author. Tel.: +81-46-291-3264; fax: + 81-46-291-3272.


E-mail address: take4@nw.kanagawa-it.ac.jp
* Corresponding author. Tel.: +81-46-291-3264; fax: + 81-46-291-3272.
E-mail address: take4@nw.kanagawa-it.ac.jp
1877-0509 © 2019 The Author(s). Published by Elsevier B.V.
1877-0509 © 2019
This is an open The article
access Author(s).
underPublished by Elsevier license
the CC BY-NC-ND B.V. (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review
This under
is an open responsibility
access of KES
article under International.
the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of KES International.

1877-0509 © 2020 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the KES International.
10.1016/j.procs.2020.09.053
Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445 3437
2 Author name / Procedia Computer Science 00 (2020) 000–000

obtain information crucial for dynamic and static analyses. Dynamic analysis examines the behavior of malware by
executing it in a sandbox based on the information obtained from the surface analysis. Static analysis finds unknown
functions that were not observed in the dynamic analysis, including functions that were hidden by an anti-sandbox.
A large number of Linux malware with statically linked library functions has been found [2]. Many of them are
stripped of all symbols such as function names and addresses, hindering function-level analysis. For example, surface
analysis is prevented from identifying malware based on function names because the names of functions statically
linked to the malware cannot be obtained; dynamic analysis cannot trace library calls called by a program with
statically linked functions; and static analysis requires a long time to find unknown functions using a disassembler, for
lack of information about the library functions. In even the most recent comprehensive analysis of various Linux
malware, statically linked library functions were not analyzed [2].
Various methods of identifying functions in malware have been proposed. Function identification by pattern
matching is difficult because compiler-generated machine code differs for each toolchain, which is a set of
programming tools such as a compiler, C library, assembler, etc. Moreover, function identification based on the
similarity of machine code and/or its control flow graph gives false results if two different functions have the same
sequence of instructions but different registers or different immediate data. In fact, many C library functions result in
false positives for this reason.
Many Linux malware use cross-compilers and libraries for embedded devices [2]. In addition, since much Linux
malware is not equipped with anti-analysis features such as packing, sandbox detection and anti-execution [2], it is
likely to use a well-known toolchain rather than a customized toolchain for anti-analysis. Indeed, our pilot study has
confirmed that many malware have been built with the toolchain of Firmware Linux 0.9.6, which is described in the
Mirai installation guide [3].
Because of the high diversity of C libraries for desktop computers and servers, pattern matching for identification
of these library functions requires generation of various patterns from a large number of libraries. This requires a lot
of time and effort. In contrast, there are few C libraries for embedded devices, and fewer versions. Furthermore,
malware developers prefer toolchains built with Firmware Linux series [4]. Therefore, pattern matching should be
effective for identifying library functions.
As a first step towards the analysis of Linux malware with statically linked library functions, we aim to identify
library functions by matching patterns generated from the library in toolchains built with Firmware Linux series and
other building tools highly ranked in the search results. To confirm that the library functions could be identified using
pattern matching, we generated patterns for library functions and attempted to identify the library functions and
toolchains used by 2,256 malware samples with the Intel 80386 architecture collected by a honeypot in our laboratory.
We investigated libraries and programming languages of the malware, and toolchains frequently used by malware
families.
Our findings are:
 Pattern matching identified the toolchains used to build 97.7% of samples, and hence pattern matching can be
effective for identification of library functions statically linked to Linux malware on embedded devices.
 99.8% of malware samples were built using only 12 toolchains, of which 11 are available on the Internet.
 94.5% of malware samples were built with only the toolchain of Firmware Linux 0.9.6, which is described in the
Mirai installation guide.
 The malware programming language was C in 99.7% of cases, C++ in 0.09%, and Go in 0.04%.
 The C library statically linked to the malware was uClibc in 96.5% of cases, musl in 1.3%, and GLIBC in 2.0%.

2. Related work

IDA F.L.I.R.T. [5] performs signature matching. The signatures consist of the first 32 bytes of the function prolog
and a CRC checksum of the remaining function body. However, the checksum is only computed until the first variant
byte appears, so unique matching of C library functions will often not be achieved.
Libc-database [6] identifies Linux distributions and GLIBC libraries by using the database of functions in GLIBC
of various Linux distributions. Given some function names and starting addresses, the libc-database can identify their
Linux distribution and GLIBC library. However, malware stripped of all symbols do not contain function names and
starting address, resulting in difficulty in identification. In addition, since the libc-database is built for Linux on
3438 Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445
Author name / Procedia Computer Science 00 (2020) 000–000 3

desktop computers or servers such as Ubuntu, not for embedded Linux, it cannot be used to identify libraries of Linux
malware running on embedded devices such as IoT devices.
KISS-lib [7] identifies functions by efficient pattern matching. It combines a large database, such as libc-database
with a fast search method, such as IDA F.L.I.R.T. KISS-lib achieved a success rate of greater than 99.9% for function
identification from fragment data leaked by exploits. However, only 17.2% of the library functions statically linked
to a program were identified, so KISS-lib is unsuitable for malware analysis.
FCatalog [8] identifies functions by comparing instruction sequences of pairs of functions. It can identify a function
even if the order of instructions in the pair differ or an instruction is replaced with an alternative. However, they may
falsely identify different functions as being the same if they have the same sequence of instructions but different
registers or different immediate data of some of the instructions. BinDiff [9] and related methods [10-13] identify a
function by comparing control flow graphs of two functions. They are tolerant to instruction replacement and sequence
reordering in a node of the control flow graph, but they may falsely identify different functions as being the same if
they have the same control flow graph.
BinSequence [14] combines instruction sequence similarity and control flow graph similarity to identify functions.
BinSequence identified functions with the highest accuracy when compared to four other methods: BinDiff, FCatalog,
Diaphora, and PatchDiff2 [15] Like above, BinSequence may falsely identify different functions as being the same if
a pair of functions have a similar sequence of instructions or similar control flow graphs.

3. Dataset

We collected malware samples using cowrie [16], a known low-interactive honeypot for SSH and Telnet services,
intermittently between August 2017 and September 2019. Non-ELF file malware, such as Perl scripts, were excluded
from collected samples. Our dataset consisted of 3,052 ELF executables, covering nine different architectures. The
percentage distribution of the architectures is shown in Table 1. The most frequent was Intel 80386, and93.9% of all
the samples had a 32-bit architecture, probably because 32-bit architecture is used in most embedded devices.
Furthermore, using the file command, we found that 98.5% of the samples were statically linked, and 78.5% were
stripped of all symbols, e.g., function names, function addresses, etc. In other words, in 78.5% of the samples, the
library functions were cloaked in secrecy.

Table 1. Percentage of architectures in the collection of samples.

Architecture Number of samples Percentage


Intel 80386 2,272 74.44%
MIPS 32-bit 408 13.37%
x86-64 187 6.13%
ARM 32-bit 103 3.37%
Renesas 32-bit 28 0.92%
Power PC 32-bit 18 0.59%
Motorola 32-bit 17 0.56%
SPARC 32-bit 14 0.46%
ARC 32-bit 5 0.16%

All samples were scanned by VirusTotal, then fed into the AVClass tool [17] to classify them into malware families.
The percentage distribution of the malware families in the collection is shown in Table 2. The most frequent malware
family was Mirai, accounting for 81.65% of the total, and 99.2% of samples belonged to DDoS malware families:
Mirai, Gafgyt, Xor DDos, Tsunami, DDosSTF, ChinaZ, and Setag. The AVClass failed to classify 19 samples into a
family.
To bypass detection by pattern matching, 30.9% of the samples had been packed, necessitating unpacking to
identify the library functions due to the malware being compressed or encrypted. Of the packed samples, 87.7% were
Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445 3439
4 Author name / Procedia Computer Science 00 (2020) 000–000

UPX-packed, and we unpacked them with UPX. The remaining 12.3% of the samples were packed with some other
packers, and we ran each of them in a sandbox and extracted their unpacked executable code from the virtual memory
just after unpacking.

Table 2. Percentage of malware families in the collection of samples

Malware family name Number of samples Percentage


Mirai 2,492 81.65%
Gafgyt 475 15.56%
Xor DDoS 41 1.34%
Tsunami 17 0.56%
DDosSTF 2 0.07%
ChinaZ 1 0.03%
Lightaidra 1 0.03%
Setag 1 0.03%
Silex 1 0.03%
SSHgo 1 0.03%
XMRig 1 0.03%
Unclassified samples 19 0.62%

4. Function identification and definition of identification accuracy

4.1. Identification of library functions and rule generation by YARA

Identification of library functions was accomplished by comparing the machine code of functions statically linked
to malware with that of static library functions in toolchains frequently used in embedded devices. The assumption
was that many samples are built with compilers frequently used for embedded devices. For pattern matching, we used
YARA, a pattern matching tool for malware researchers. A YARA rule for malware detection corresponds to a
signature of malware such as byte sequences specific to that malware, while a YARA rule for library identification
corresponds to machine code in a library function.
Static library functions linked to ELF executables are identical to functions in static libraries such as libc.a and
libstdc++.a, except for some relocatable addresses they contain, since they are overwritten by a linker according to
their relocation type on a relocation table. To generate YARA rules for function identification, we first retrieved each
function in all the static libraries in turn and generated a YARA rule for each function. The YARA rule consisted of
a hexadecimal string of machine code in each function to a maximum of 200 bytes. Relocatable addresses were
replaced by a YARA wildcard. Some library functions can be very short, e.g., only one byte. To prevent such non-
function code being falsely identified as a function, YARA rules were not generated from functions 3 bytes or smaller,
except for wildcards.

4.2. Definition of identification accuracy

Applying the YARA rules, we defined the identification accuracy of library functions by using classes in a
confusion matrix: a true positive, a true negative, a false positive and a false negative. To define these classes, we
showed the order in which all functions are linked to the ELF executable file by the GNU linker.

1. C runtime functions defined in crt1.o, crti.o and crtbeginT.o.


2. Program defined functions.
3. Library functions defined in libc.a, libm.a, libstdc++.a, libgcc.a, libssl.a, etc.
3440 Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445
Author name / Procedia Computer Science 00 (2020) 000–000 5

4. C runtime functions defined in crtend.o and crtn.o.


The first and fourth C runtime functions and the third library functions were defined as a positive class, and the
second program defined functions were defined as a negative class. Under these definitions, the identification result
of each function is classified into the following classes in the mixed matrix:

 True positive (TP): the library function is correctly identified.


 True negative (TN): the program defined function is correctly identified.
 False positive (FP): the identified function is not a library function.
 False negative (FN): the library function is not identified.

Based on these classes in the mixed matrix, we defined the detection accuracy of the library functions:

 Accuracy of function identification:


accuracy = (TP + TN) / (TP + TN + FP + FN)
 True positive rate (TPR), i.e., coverage of library functions correctly classified:
TPR = TP / (TP + FN)
 True negative rate (TNR), i.e., coverage of program-defined functions correctly classified:
TNR = TN / (TN + FP)

4.3. Estimation of detection accuracy

We could not confirm whether the library functions were identified correctly in 78.5% of the samples because their
function names and addresses had been stripped. We estimated TP, TN, FP and FN based on the order of the functions
linked by the GNU linker as described in Section 4.2.
First, we used Capstone [18], a lightweight multi-architecture disassembly platform, to find the functions in the
malware. Then, we scanned the malware using YARA to find the library functions, which we then sorted in virtual
address order. Based on the order of the functions linked by the GNU linker, the first found function just after the first
set of program-defined functions is assumed to be the first library function since program-defined functions are not
detected by our YARA rules, and the last found function just before the last set of C runtime functions is assumed to
be the last library function. That is, all functions from the first library function to the last library function are assumed
to be library functions. The first function just after the first set of C runtime functions is assumed to be the first
program-defined function, and the last function just before the first library function is assumed to be the last program-
defined function. That is, all functions from the first program-defined function to the last program-defined function
are assumed to be program-defined functions. Under this assumption, if the true first library function is missed, the
functions from the true first library function and the false first library function are regarded as program-defined
functions, resulting in an overestimate of identification accuracy. This overestimate can be manually confirmed and
corrected.

4.4. Evaluation of estimated identification accuracy

We evaluated estimates of identification accuracy without function names and addresses for 93 unstripped samples
and five samples built from source code with a toolchain of Aboriginal Linux 1.2.0 i586. There were 30 unstripped
samples from the Mirai, 30 from the Gafgyt, 1 from the Silex, and 32 from the Xor DDoS family. We confirmed that
the estimates of identification accuracy were the same as the true identification accuracy for all samples. In addition,
we visually confirmed that the first and last library functions were correctly identified in all samples, and hence our
estimates of identification accuracy were valid. The estimates of the identification accuracy are listed in Table 3.
We examined functions not identified by YARA. There were only three such functions, all of which had been
excluded from the YARA rules because they were three bytes or less. The details are as follows:
Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445 3441
6 Author name / Procedia Computer Science 00 (2020) 000–000

 __errno_location and __h_errno_location (2 bytes)


b8 ?? ?? ?? ?? : mov eax, errno // “??” indicates a relocatable address
c3 : ret
 rand (1 byte)
e9 ?? ?? ?? ?? : jmp random

Table 3. Estimated identification accuracy of library functions

Malware family name Source Number of samples Estimated identification accuracy


BASHLITE Github [19] 1 99.6%
Lightaidra Github [19] 1 99.6%
Lizkebab Github [19] 1 99.6%
PNScan Github [19] 1 100%
Mirai Github [20] 1 100%
Mirai Cowrie 30 98.9% ~ 99.1%
Gafgyt Cowrie 30 99.0%
Silex Cowrie 1 98.7%
Xor DDoS Cowrie 32 100%
Unclassified samples Cowrie 10 98.9 ~ 100%

5. Analysis of toolchains and library functions

5.1. Target samples

Before examining all of the samples, we limited our investigation to samples with the Intel 80386 architecture to
verify that our approach can identify library functions in Linux malware. Furthermore, eight samples with dynamically
linked library functions were excluded from our dataset, because their library functions could be easily identified by
looking at their “.dynsym” section. Therefore, our final dataset consisted of 2,256 samples with statically linked library
functions for the Intel 80386 architecture.

5.2. Analysis of toolchains

5.2.1. Toolchain selection and YARA rule generation


It is likely that many malware developers use well known toolchains for embedded devices to build malware
without spending a lot of time building a toolchain for the malware development environment. Assuming that malware
developers use well-known toolchains, we selected toolchains built with Firmware Linux series, which is described
in the installation guide of Mirai, and their successor, Aboriginal Linux, as well as three build tools based on the search
results. The build tools and their release date are shown in Table 4. YARA rules were generated from the C library
and C runtime included in each toolchain. Our YARA rule files are available on Github [25].

Table 4. Build tools for a toolchain

Build tool Release Date Reference


Firmware Linux 0.9.6 ~ 0.9.11 2009/04/02 ~ 2010/03/29 [4]
Aboriginal Linux 1.0.0 ~ 1.4.5 2010/09/04 ~ 2016/01/11 [21]
Buildroot 2018.02 ~ 2019.05 2018/04/10 ~ 2019/06/02 [22]
Yocto bitbake 1.40 2018/11/15 [23]
crosstool-NG 1.23.0 2017/04/20 [24]
3442 Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445
Author name / Procedia Computer Science 00 (2020) 000–000 7

5.2.2. Results of toolchain identification


Toolchain identification was based not on accuracy of function identification but on coverage of library functions,
because the accuracy of function identification cannot be calculated. The accuracy of function identification depends
on the numbers of true positives, true negatives, false positives and false negatives, but the numbers of true negatives
and false positives could not be counted because the names and addresses of program-defined functions, i.e., negative
functions, had been stripped in 95.3% of samples with the Intel 80386 architecture. A coverage of at least 90% of the
library functions means that the toolchain has been identified because at most 10% of library functions are three bytes
or less, and therefore excluded from the YARA rules.
The YARA rule files were unnecessary to identify the toolchain of 46 samples. Of these, 45 samples had
information about the toolchain, including the Linux distribution name and its version, in their comment section, and
the other sample was built with the Go language. The toolchain identification of all 2,256 samples are shown in Table
5. It required an average of 0.51 sec. per sample to identify library functions. Pattern matching identified toolchains
used to build 97.7% of the samples, indicating that pattern matching was effective for function identification of Linux
malware on embedded devices. Only 12 toolchains, 11 of which are available in binary form on the Internet, had been
used to build 99.8% of the samples, as we had predicted. The most used toolchain was a toolchain built with Firmware
Linux 0.9.6. This may be due to the fact that the toolchain of Firmware Linux 0.9.6 is described in the installation
guide of Mirai. Two of 5 samples that were under further investigation were built with a toolchain from a combination
of GCC 8.2.0 i586, binutils 2.31.1 and musl 1.1.19, but we did not identify any tool used to build the toolchain itself.

Table 5. Results of toolchain identification

Building tool or Linux distribution: toolchain Number of samples Percentage

Firmware Linux 0.9.6 i586 : GCC 4.1.2, binutils 2.17, uClibc 0.9.30.1 2,131 94.46%
CentOS 5.5 i386 : GCC 4.1.2-46 20050519, binutils 2.17.50.0.6, GLIBC 2.5-49 41 1.82%
Aboriginal Linux 1.4.4 i586 : GCC 4.2.1, binutils 2.17, musl 1.1.12 28 1.24%
Buildroot 2018.08 i686: GCC 7.3.0, binutils 2.31.1, uClibc-ng 1.0.30 21 0.93%
Firmware Linux 0.9.6 i686 : GCC 4.1.2, binutils 2.17, uClibc 0.9.30.1 16 0.71%
Aboriginal Linux 1.2.6 i586 : GCC 4.2.1, binutils 2.17, uClibc 0.9.33.2 7 0.31%
Ubuntu 16.04.4 i386: GCC 5.4.0-6 20160609, binutils 2.26.1, GLIBC 2.23 2 0.09%
Debian 7 i386: GCC 4.4.7-2, binutils 2.22-8, GLIBC 2.13-1 1 0.04%
Fedora Core 4 i386: GCC 4.0.0-8, binutils 2.15.94, GLIBC 2.3.5 1 0.04%
Aboriginal Linux 1.2.1 i586 : GCC 4.2.1, binutils 2.17, uClibc 0.9.33.2 1 0.04%
Aboriginal Linux 1.2.6 i686 : GCC 4.2.1, binutils 2.17, uClibc 0.9.33.2 1 0.04%
Go 1 0.04%
Under further investigation 5 0.22%

5.3. Programming languages and libraries

Table 6 lists the programming languages used by the samples in Table 5 and the libraries linked to them. The most
used library, occurring in 99.7% of the samples, was uClibc, a widely known C library for Linux-based embedded
devices, which is lighter and more portable than GLIBC. GLIBC occurred in 1.99% of samples, while 1.33% of
samples used musl. The latter is a C library for Linux-based systems, and is optimized for static linking to allow an
application to be deployed as a single portable binary without significant overhead. It has been employed by
Aboriginal Linux 1.4.4 or later. We also found samples built with C++ and Go languages. In 2019, Palo Alto Networks
reported that Go as a malware development language is still very much in its infancy, but malware written in Go
language appears to be growing in popularity [26].
Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445 3443
8 Author name / Procedia Computer Science 00 (2020) 000–000

Table 6. Sample language and library percentage

Programming language C library Number of samples Percentage


C uClibc 2,177 96.50%
C GLIBC 43 1.91%
C musl 30 1.33%
C++ GLIBC 2 0.09%
Go - 1 0.04%

5.4. Toolchains and Malware Families

We investigated the relationship between the malware family and its toolchains based on the toolchain
identification results in Table 5. The results are shown in Table 7. Table 7 does not include some of the malware
families listed in Table 2 because they were not included in the Intel 80386 samples. Table 7 shows that 97.3 of the
Mirai, 95.3% of the Gafgyt and the single Silex were built with toolchains of Firmware Linux 0.9.6. Some of the Mirai
and Gafgyt were built with a new toolchain of Buildroot 2018.08, and these samples were collected around June 2019.
All of the XorDDoS were built with CentOS 5.5 GCC.

Table 7. Relationship between the malware family and its toolchains

Build tool or compiler Mirai Gafgyt Xor DDoS Silex


Firmware Linux 0.9.6 i586 97.2% 65.1% 0% 100%
CentOS 5.5 GCC 4.1.2-46 i386 0% 0% 100% 0%
Buildroot 2018.08 i686 0.9% 2.3% 0% 0%
Firmware Linux 0.9.6 i686 0.1% 30.2% 0% 0%
Aboriginal Linux 1.4.4 i586 1.3% 0% 0% 0%
Aboriginal Linux 1.2.6 i586 0.3% 0% 0% 0%
Aboriginal Linux 1.2.1 i586 0.01% 0% 0% 0%
Aboriginal Linux 1.2.6 i686 0% 2.3% 0% 0%

5.5. Analysis of library functions

We examined the number of library functions statically linked to each sample for which its toolchain was identified.
Figure 1 shows a log-log plot of the class of the number of library functions linked to a sample and the number of
samples belonging to that class for these numbers were widely distributed. The samples linking 90 to 100 functions
were the most prominent, and most of them are likely variants of Mirai, since Mirai built from source code contains
around 95 library functions.
3444 Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445
Author name / Procedia Computer Science 00 (2020) 000–000 9

Figure 1. Number of statically linked library functions and number of samples.

6. Conclusions

We collected 3,052 malware samples using the cowrie honeypot. We found that 98.5% of the samples were
statically linked, and 78.5% were stripped of all symbols. We also found that the most frequent malware family was
Mirai, accounting for 81.65% of the total, and 99.2% of samples belongs to DDoS malware families. We identified
library functions statically linked to Linux malware by matching patterns in order to assist in the analysis of Linux
malware. The results showed that pattern matching identified toolchains used to build 97.7% of 2,256 samples with
the Intel 80386 architecture, indicating that pattern matching is effective for function identification of Linux malware
on embedded devices. Only 12 toolchains were used to build 99.8% of the samples, and 11 of the toolchains are
available on the Internet, as we had predicted. We found that the programming language used by the malware was C
in 99.7% of the samples, C++ in 0.09% and Go in 0.04%. We also found that the C library was uClibc in 96.5% of the
samples, musl in 1.3% and GLIBC in 2.0%.
Future work is to identify the library functions for samples with other architectures, namely MIPS, ARM, PowerPC,
etc. Other work is the tracing of library function calls, allowing dynamic analysis at the function level, such as parsing
the history of function calling and its arguments.

References

[1] Kaspersky (2019) “IoT under fire: Kaspersky detects more than 100 million attacks on smart devices in H1 2019,”
https://www.kaspersky.com/about/press-releases/2019_iot-under-fire-kaspersky-detects-more-than-100-million-attacks-on-smart-devices-
in-h1-2019.
[2] E, Cozzi, et al. (2018) “Understanding Linux malware,” IEEE Symposium on Security and Privacy, pp.161-175.
[3] Anna-senpai (2016) “World’s largest net: Mirai botnet, client, echo loader, CNC source code release,” https://github.com/jgamblin/Mirai-
Source-Code/blob/master/ForumPost.md.
[4] R, Landley (2002) Firmware Linux, https://landley.net/code/firmware/old.
[5] Hex-Rays (2015) “IDA F.L.I.R.T. technology: in-depth,” https://www.hex-rays.com/products/ida/tech/flirt/in_depth/.
[6] Karlsruhe Institute for Technology CTF Team (2018) libc-database, https://kitctf.de/tools/.
[7] Maximilian v, T (2018) “Library and function identification by optimized pattern matching on compressed databases: A close to perfect
identification of known code snippets,” Proceedings of the 2nd Reversing and Offensive-oriented Trends Symposium, pp.1-12.
Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445 3445
10 Author name / Procedia Computer Science 00 (2020) 000–000

[8] xorpd (2015) FCatalog, https://www.xorpd.net/pages/fcatalog.html.


[9] T, Dullien, et al.(2005) “Graph-based comparison of executable objects,” Symposium sur la sécurité des technologies de l’information et des
communications, http://www.zynamics.com/downloads/bindiffsstic05-1.pdf.
[10] J, Koret (2015) Diaphora, http://diaphora.re.
[11] T, Dullien (2018) “The good 0(ld) days Finding old bits of code in binaries in the hope of finding 0day,” JD-HITBSecConf,
https://t.co/48Pln7FCzs?amp=1.
[12] CONIX (2017) Machoke, https://www.conix.fr/machoke-hashing.
[13] Q, Feng. et al. (2016) “Scalable graph-based bug search for firmware images,” Proceedings of the 2016 ACM SIGSAC Conference on
Computer and Communications, pp.480-491.
[14] H, Huang, et al. (2017) “BinSequence: fast, accurate and scalable binary CodeReuse detection,” Proceedings of the 2017 ACM on Asia
Conference on Computer and Communications, pp.155-166.
[15] N, Pouvesle (2010) PatchDiff2, https://code.google.com/archive/p/patchdiff2.
[16] M, Iisterhof. et al. (2014) Cowrie, https://www.cowrie.org.
[17] M, Sebastián et al. (2016) “Avclass: A tool for massive malware labeling,” International Symposium on Research in Attacks, Intrusions, and
Defenses, pp.230-253.
[18] Capstone (2013) capstone, https://www.capstone-engine.org.
[19] F, Ding ( 2017) iot-malware, https://github.com/ifding/iot-malware.
[20] J, Gamblin (2016) “Mirai-Source-Code,” https://github.com/jgamblin/Mirai-Source-Code.
[21] R, Landley (2006) Aboriginal Linux, https://landley.net/aborigina
[22] P, Korsgaard. Et al. (2005) Buildroot, https://buildroot.org.
[23] Yocto Project (2010) Yocto, https://www.yoctoproject.org.
[24] R, Day. Et al. (2007) crosstool-NG, https://crosstool-ng.github.io.
[25] S, Akabane et al. (2020) stelftools, https://github.com/shuakabane/stelftools.
[26] J, Grunzweig (2019) “The gopher in the room: Analysis of GoLang malware in the wild,” Palo Alt Networks Unit 42,
https://unit42.elegance.work/the-gopher-in-the-room-analysis-of-golang-malware-in-the-wild.

You might also like