Professional Documents
Culture Documents
com
Available online at www.sciencedirect.com
ScienceDirect
ScienceDirect
Procedia
Available Computer
online Science 00 (2020) 000–000
at www.sciencedirect.com
Procedia Computer Science 00 (2020) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 176 (2020) 3436–3445
Abstract
Abstract
Many Linux malware have been found to have statically linked library functions. Much of this malware are stripped of function
Many
names Linux malware hindering
and addresses, have beenfunction-level
found to haveanalysis.
staticallyFor
linked library functions.
function-level analysis,Much of this malware
we identified are stripped
library functions of function
stically linked
names
to 2,256and addresses,
malware hindering
samples function-level
with the Intel 80386 analysis. For by
architecture function-level analysis,
matching patterns. Thewepattern
identified libraryidentified
matching functionsmore
stically
thanlinked
90%
to
of 2,256 malware
the library samplesforwith
functions the Intel
97.7% 80386
of the architecture
samples. by matching
Thus, pattern patterns.
matching Theeffective
can be pattern matching
for libraryidentified more than
identification. Only90%
12
of the library
toolchains hadfunctions
been usedfor 97.7%99.8%
to build of theofsamples.
samples,Thus,
and 11pattern
of the matching
toolchainscan
are be effective
available on for library identification.
the Internet. The C libraryOnly
used 12
by
toolchains
the malware had
wasbeen usedinto96.5%
uClibc build of
99.8% of samples,
the samples, musland
in 11
1.3%of the
andtoolchains
GLIBC in are available on the Internet. The C library used by
2.0%.
the malware was uClibc in 96.5% of the samples, musl in 1.3% and GLIBC in 2.0%.
© 2019 The Author(s). Published by Elsevier B.V.
© 2019
© 2020 The
The Author(s).
Authors. Published
Published bybyElsevier
ElsevierB.V.
B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open
Peer-review access
under article under
responsibility of the CC
KES BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
International.
Peer-review under responsibility of the scientific committee of the KES International.
Peer-review under responsibility of KES International.
Keywords: Libray function identification, library identification, toolchain identification, Linux malware analysis, pattern matching
Keywords: Libray function identification, library identification, toolchain identification, Linux malware analysis, pattern matching
1. Introduction
1. Introduction
Cyberattacks from malware-infected IoT devices have been observed around the world since 2016, with Kaspersky
Cyberattacks
reporting from malware-infected
100 million attacks detected inIoTthedevices have
first half ofbeen
2019observed around
alone [1]. Sincethe world
many IoTsince 2016,run
devices with
onKaspersky
Linux for
reporting 100 million attacks detected in the first half of 2019 alone [1]. Since many IoT devices
embedded devices, the amount of Linux malware is increasing rapidly, resulting in the demand for analysis run on Linux for
of Linux
embedded devices, the amount of Linux malware
malware and more sophisticated analysis techniques. is increasing rapidly, resulting in the demand for analysis of Linux
malware
Malwareandanalysis
more sophisticated analysis
can be divided techniques.
into surface, dynamic and static analyses. Surface analysis examines the similarity
Malware analysis can be divided into surface,
to previously analyzed malware by checking character dynamicstrings
and static
and analyses.
function Surface analysis examines
names contained the similarity
in the malware file to
to previously analyzed malware by checking character strings and function names contained in the malware file to
obtain information crucial for dynamic and static analyses. Dynamic analysis examines the behavior of malware by
executing it in a sandbox based on the information obtained from the surface analysis. Static analysis finds unknown
functions that were not observed in the dynamic analysis, including functions that were hidden by an anti-sandbox.
A large number of Linux malware with statically linked library functions has been found [2]. Many of them are
stripped of all symbols such as function names and addresses, hindering function-level analysis. For example, surface
analysis is prevented from identifying malware based on function names because the names of functions statically
linked to the malware cannot be obtained; dynamic analysis cannot trace library calls called by a program with
statically linked functions; and static analysis requires a long time to find unknown functions using a disassembler, for
lack of information about the library functions. In even the most recent comprehensive analysis of various Linux
malware, statically linked library functions were not analyzed [2].
Various methods of identifying functions in malware have been proposed. Function identification by pattern
matching is difficult because compiler-generated machine code differs for each toolchain, which is a set of
programming tools such as a compiler, C library, assembler, etc. Moreover, function identification based on the
similarity of machine code and/or its control flow graph gives false results if two different functions have the same
sequence of instructions but different registers or different immediate data. In fact, many C library functions result in
false positives for this reason.
Many Linux malware use cross-compilers and libraries for embedded devices [2]. In addition, since much Linux
malware is not equipped with anti-analysis features such as packing, sandbox detection and anti-execution [2], it is
likely to use a well-known toolchain rather than a customized toolchain for anti-analysis. Indeed, our pilot study has
confirmed that many malware have been built with the toolchain of Firmware Linux 0.9.6, which is described in the
Mirai installation guide [3].
Because of the high diversity of C libraries for desktop computers and servers, pattern matching for identification
of these library functions requires generation of various patterns from a large number of libraries. This requires a lot
of time and effort. In contrast, there are few C libraries for embedded devices, and fewer versions. Furthermore,
malware developers prefer toolchains built with Firmware Linux series [4]. Therefore, pattern matching should be
effective for identifying library functions.
As a first step towards the analysis of Linux malware with statically linked library functions, we aim to identify
library functions by matching patterns generated from the library in toolchains built with Firmware Linux series and
other building tools highly ranked in the search results. To confirm that the library functions could be identified using
pattern matching, we generated patterns for library functions and attempted to identify the library functions and
toolchains used by 2,256 malware samples with the Intel 80386 architecture collected by a honeypot in our laboratory.
We investigated libraries and programming languages of the malware, and toolchains frequently used by malware
families.
Our findings are:
Pattern matching identified the toolchains used to build 97.7% of samples, and hence pattern matching can be
effective for identification of library functions statically linked to Linux malware on embedded devices.
99.8% of malware samples were built using only 12 toolchains, of which 11 are available on the Internet.
94.5% of malware samples were built with only the toolchain of Firmware Linux 0.9.6, which is described in the
Mirai installation guide.
The malware programming language was C in 99.7% of cases, C++ in 0.09%, and Go in 0.04%.
The C library statically linked to the malware was uClibc in 96.5% of cases, musl in 1.3%, and GLIBC in 2.0%.
2. Related work
IDA F.L.I.R.T. [5] performs signature matching. The signatures consist of the first 32 bytes of the function prolog
and a CRC checksum of the remaining function body. However, the checksum is only computed until the first variant
byte appears, so unique matching of C library functions will often not be achieved.
Libc-database [6] identifies Linux distributions and GLIBC libraries by using the database of functions in GLIBC
of various Linux distributions. Given some function names and starting addresses, the libc-database can identify their
Linux distribution and GLIBC library. However, malware stripped of all symbols do not contain function names and
starting address, resulting in difficulty in identification. In addition, since the libc-database is built for Linux on
3438 Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445
Author name / Procedia Computer Science 00 (2020) 000–000 3
desktop computers or servers such as Ubuntu, not for embedded Linux, it cannot be used to identify libraries of Linux
malware running on embedded devices such as IoT devices.
KISS-lib [7] identifies functions by efficient pattern matching. It combines a large database, such as libc-database
with a fast search method, such as IDA F.L.I.R.T. KISS-lib achieved a success rate of greater than 99.9% for function
identification from fragment data leaked by exploits. However, only 17.2% of the library functions statically linked
to a program were identified, so KISS-lib is unsuitable for malware analysis.
FCatalog [8] identifies functions by comparing instruction sequences of pairs of functions. It can identify a function
even if the order of instructions in the pair differ or an instruction is replaced with an alternative. However, they may
falsely identify different functions as being the same if they have the same sequence of instructions but different
registers or different immediate data of some of the instructions. BinDiff [9] and related methods [10-13] identify a
function by comparing control flow graphs of two functions. They are tolerant to instruction replacement and sequence
reordering in a node of the control flow graph, but they may falsely identify different functions as being the same if
they have the same control flow graph.
BinSequence [14] combines instruction sequence similarity and control flow graph similarity to identify functions.
BinSequence identified functions with the highest accuracy when compared to four other methods: BinDiff, FCatalog,
Diaphora, and PatchDiff2 [15] Like above, BinSequence may falsely identify different functions as being the same if
a pair of functions have a similar sequence of instructions or similar control flow graphs.
3. Dataset
We collected malware samples using cowrie [16], a known low-interactive honeypot for SSH and Telnet services,
intermittently between August 2017 and September 2019. Non-ELF file malware, such as Perl scripts, were excluded
from collected samples. Our dataset consisted of 3,052 ELF executables, covering nine different architectures. The
percentage distribution of the architectures is shown in Table 1. The most frequent was Intel 80386, and93.9% of all
the samples had a 32-bit architecture, probably because 32-bit architecture is used in most embedded devices.
Furthermore, using the file command, we found that 98.5% of the samples were statically linked, and 78.5% were
stripped of all symbols, e.g., function names, function addresses, etc. In other words, in 78.5% of the samples, the
library functions were cloaked in secrecy.
All samples were scanned by VirusTotal, then fed into the AVClass tool [17] to classify them into malware families.
The percentage distribution of the malware families in the collection is shown in Table 2. The most frequent malware
family was Mirai, accounting for 81.65% of the total, and 99.2% of samples belonged to DDoS malware families:
Mirai, Gafgyt, Xor DDos, Tsunami, DDosSTF, ChinaZ, and Setag. The AVClass failed to classify 19 samples into a
family.
To bypass detection by pattern matching, 30.9% of the samples had been packed, necessitating unpacking to
identify the library functions due to the malware being compressed or encrypted. Of the packed samples, 87.7% were
Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445 3439
4 Author name / Procedia Computer Science 00 (2020) 000–000
UPX-packed, and we unpacked them with UPX. The remaining 12.3% of the samples were packed with some other
packers, and we ran each of them in a sandbox and extracted their unpacked executable code from the virtual memory
just after unpacking.
Identification of library functions was accomplished by comparing the machine code of functions statically linked
to malware with that of static library functions in toolchains frequently used in embedded devices. The assumption
was that many samples are built with compilers frequently used for embedded devices. For pattern matching, we used
YARA, a pattern matching tool for malware researchers. A YARA rule for malware detection corresponds to a
signature of malware such as byte sequences specific to that malware, while a YARA rule for library identification
corresponds to machine code in a library function.
Static library functions linked to ELF executables are identical to functions in static libraries such as libc.a and
libstdc++.a, except for some relocatable addresses they contain, since they are overwritten by a linker according to
their relocation type on a relocation table. To generate YARA rules for function identification, we first retrieved each
function in all the static libraries in turn and generated a YARA rule for each function. The YARA rule consisted of
a hexadecimal string of machine code in each function to a maximum of 200 bytes. Relocatable addresses were
replaced by a YARA wildcard. Some library functions can be very short, e.g., only one byte. To prevent such non-
function code being falsely identified as a function, YARA rules were not generated from functions 3 bytes or smaller,
except for wildcards.
Applying the YARA rules, we defined the identification accuracy of library functions by using classes in a
confusion matrix: a true positive, a true negative, a false positive and a false negative. To define these classes, we
showed the order in which all functions are linked to the ELF executable file by the GNU linker.
Based on these classes in the mixed matrix, we defined the detection accuracy of the library functions:
We could not confirm whether the library functions were identified correctly in 78.5% of the samples because their
function names and addresses had been stripped. We estimated TP, TN, FP and FN based on the order of the functions
linked by the GNU linker as described in Section 4.2.
First, we used Capstone [18], a lightweight multi-architecture disassembly platform, to find the functions in the
malware. Then, we scanned the malware using YARA to find the library functions, which we then sorted in virtual
address order. Based on the order of the functions linked by the GNU linker, the first found function just after the first
set of program-defined functions is assumed to be the first library function since program-defined functions are not
detected by our YARA rules, and the last found function just before the last set of C runtime functions is assumed to
be the last library function. That is, all functions from the first library function to the last library function are assumed
to be library functions. The first function just after the first set of C runtime functions is assumed to be the first
program-defined function, and the last function just before the first library function is assumed to be the last program-
defined function. That is, all functions from the first program-defined function to the last program-defined function
are assumed to be program-defined functions. Under this assumption, if the true first library function is missed, the
functions from the true first library function and the false first library function are regarded as program-defined
functions, resulting in an overestimate of identification accuracy. This overestimate can be manually confirmed and
corrected.
We evaluated estimates of identification accuracy without function names and addresses for 93 unstripped samples
and five samples built from source code with a toolchain of Aboriginal Linux 1.2.0 i586. There were 30 unstripped
samples from the Mirai, 30 from the Gafgyt, 1 from the Silex, and 32 from the Xor DDoS family. We confirmed that
the estimates of identification accuracy were the same as the true identification accuracy for all samples. In addition,
we visually confirmed that the first and last library functions were correctly identified in all samples, and hence our
estimates of identification accuracy were valid. The estimates of the identification accuracy are listed in Table 3.
We examined functions not identified by YARA. There were only three such functions, all of which had been
excluded from the YARA rules because they were three bytes or less. The details are as follows:
Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445 3441
6 Author name / Procedia Computer Science 00 (2020) 000–000
Before examining all of the samples, we limited our investigation to samples with the Intel 80386 architecture to
verify that our approach can identify library functions in Linux malware. Furthermore, eight samples with dynamically
linked library functions were excluded from our dataset, because their library functions could be easily identified by
looking at their “.dynsym” section. Therefore, our final dataset consisted of 2,256 samples with statically linked library
functions for the Intel 80386 architecture.
Firmware Linux 0.9.6 i586 : GCC 4.1.2, binutils 2.17, uClibc 0.9.30.1 2,131 94.46%
CentOS 5.5 i386 : GCC 4.1.2-46 20050519, binutils 2.17.50.0.6, GLIBC 2.5-49 41 1.82%
Aboriginal Linux 1.4.4 i586 : GCC 4.2.1, binutils 2.17, musl 1.1.12 28 1.24%
Buildroot 2018.08 i686: GCC 7.3.0, binutils 2.31.1, uClibc-ng 1.0.30 21 0.93%
Firmware Linux 0.9.6 i686 : GCC 4.1.2, binutils 2.17, uClibc 0.9.30.1 16 0.71%
Aboriginal Linux 1.2.6 i586 : GCC 4.2.1, binutils 2.17, uClibc 0.9.33.2 7 0.31%
Ubuntu 16.04.4 i386: GCC 5.4.0-6 20160609, binutils 2.26.1, GLIBC 2.23 2 0.09%
Debian 7 i386: GCC 4.4.7-2, binutils 2.22-8, GLIBC 2.13-1 1 0.04%
Fedora Core 4 i386: GCC 4.0.0-8, binutils 2.15.94, GLIBC 2.3.5 1 0.04%
Aboriginal Linux 1.2.1 i586 : GCC 4.2.1, binutils 2.17, uClibc 0.9.33.2 1 0.04%
Aboriginal Linux 1.2.6 i686 : GCC 4.2.1, binutils 2.17, uClibc 0.9.33.2 1 0.04%
Go 1 0.04%
Under further investigation 5 0.22%
Table 6 lists the programming languages used by the samples in Table 5 and the libraries linked to them. The most
used library, occurring in 99.7% of the samples, was uClibc, a widely known C library for Linux-based embedded
devices, which is lighter and more portable than GLIBC. GLIBC occurred in 1.99% of samples, while 1.33% of
samples used musl. The latter is a C library for Linux-based systems, and is optimized for static linking to allow an
application to be deployed as a single portable binary without significant overhead. It has been employed by
Aboriginal Linux 1.4.4 or later. We also found samples built with C++ and Go languages. In 2019, Palo Alto Networks
reported that Go as a malware development language is still very much in its infancy, but malware written in Go
language appears to be growing in popularity [26].
Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445 3443
8 Author name / Procedia Computer Science 00 (2020) 000–000
We investigated the relationship between the malware family and its toolchains based on the toolchain
identification results in Table 5. The results are shown in Table 7. Table 7 does not include some of the malware
families listed in Table 2 because they were not included in the Intel 80386 samples. Table 7 shows that 97.3 of the
Mirai, 95.3% of the Gafgyt and the single Silex were built with toolchains of Firmware Linux 0.9.6. Some of the Mirai
and Gafgyt were built with a new toolchain of Buildroot 2018.08, and these samples were collected around June 2019.
All of the XorDDoS were built with CentOS 5.5 GCC.
We examined the number of library functions statically linked to each sample for which its toolchain was identified.
Figure 1 shows a log-log plot of the class of the number of library functions linked to a sample and the number of
samples belonging to that class for these numbers were widely distributed. The samples linking 90 to 100 functions
were the most prominent, and most of them are likely variants of Mirai, since Mirai built from source code contains
around 95 library functions.
3444 Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445
Author name / Procedia Computer Science 00 (2020) 000–000 9
6. Conclusions
We collected 3,052 malware samples using the cowrie honeypot. We found that 98.5% of the samples were
statically linked, and 78.5% were stripped of all symbols. We also found that the most frequent malware family was
Mirai, accounting for 81.65% of the total, and 99.2% of samples belongs to DDoS malware families. We identified
library functions statically linked to Linux malware by matching patterns in order to assist in the analysis of Linux
malware. The results showed that pattern matching identified toolchains used to build 97.7% of 2,256 samples with
the Intel 80386 architecture, indicating that pattern matching is effective for function identification of Linux malware
on embedded devices. Only 12 toolchains were used to build 99.8% of the samples, and 11 of the toolchains are
available on the Internet, as we had predicted. We found that the programming language used by the malware was C
in 99.7% of the samples, C++ in 0.09% and Go in 0.04%. We also found that the C library was uClibc in 96.5% of the
samples, musl in 1.3% and GLIBC in 2.0%.
Future work is to identify the library functions for samples with other architectures, namely MIPS, ARM, PowerPC,
etc. Other work is the tracing of library function calls, allowing dynamic analysis at the function level, such as parsing
the history of function calling and its arguments.
References
[1] Kaspersky (2019) “IoT under fire: Kaspersky detects more than 100 million attacks on smart devices in H1 2019,”
https://www.kaspersky.com/about/press-releases/2019_iot-under-fire-kaspersky-detects-more-than-100-million-attacks-on-smart-devices-
in-h1-2019.
[2] E, Cozzi, et al. (2018) “Understanding Linux malware,” IEEE Symposium on Security and Privacy, pp.161-175.
[3] Anna-senpai (2016) “World’s largest net: Mirai botnet, client, echo loader, CNC source code release,” https://github.com/jgamblin/Mirai-
Source-Code/blob/master/ForumPost.md.
[4] R, Landley (2002) Firmware Linux, https://landley.net/code/firmware/old.
[5] Hex-Rays (2015) “IDA F.L.I.R.T. technology: in-depth,” https://www.hex-rays.com/products/ida/tech/flirt/in_depth/.
[6] Karlsruhe Institute for Technology CTF Team (2018) libc-database, https://kitctf.de/tools/.
[7] Maximilian v, T (2018) “Library and function identification by optimized pattern matching on compressed databases: A close to perfect
identification of known code snippets,” Proceedings of the 2nd Reversing and Offensive-oriented Trends Symposium, pp.1-12.
Shu Akabane et al. / Procedia Computer Science 176 (2020) 3436–3445 3445
10 Author name / Procedia Computer Science 00 (2020) 000–000