UISS-2020-0216.R1 Proof Hi

Information Systems Security
Fo
An Improved Model for Detecting DGA botnets Using
Random Forest Algorithm
rP
Journal: Information Security Journal: A Global Perspective

ee
Manuscript ID UISS-2020-0216.R1
Manuscript Type: Original Article

rR
Information Security and Risk Management, Application Security,

Keywords (selected):
Telecommunications and Network Security
DGA botnet detection, fast-flux botnet detection, botnet detection model,

Keywords (author supplied):
ev
machine learning-based botnet detection

iew
On
ly
URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

Page 1 of 22 Information Systems Security
1
2
3
4
An Improved Model for Detecting DGA botnets Using
5
6
Random Forest Algorithm
7
8
9 Recently, detecting botnets and especially DGA botnets has been the research
10 interest of many researchers worldwide because of botnets’ wide spreading, high
11
12 sophistication and serious consequences to many organizations and users. Several
13
approaches based on statistics and machine learning techniques to detect DGA
14
15 botnets have been proposed. The key idea of these approaches is to construct
16
17 detection models to classify legitimate domain names and botnet generated
18
Fo
domain names. Although the initial results are promising, the false alarm rates of
19
20 these approaches are still high. This paper extends the machine learning-based
21 detection model proposed by a previous research by adding new domain
rP
22
23 classification features in order to reduce the false alarm rates as well as to
24
increase the detection rate. Extensive experiments on a large dataset of domain
ee
25
26 names used by various DGA botnets confirm that the improved detection model
27
28 outperforms the original model and some other previous DGA botnet detection
rR
29 models. The proposed model’s false alarm rate is less than 3.02% and its overall
30
31 detection accuracy and the F1-score are both at 97.03%.
ev
32
33
34 Keywords: DGA botnet detection, fast-flux botnet detection, botnet detection
iew
35 model, machine learning-based botnet detection

36
37
38 Subject classification codes: include these here if the journal requires them
39
40
On
41 1 Introduction
42
43
44 Over last decade, botnets have been considered one of the major security threats to
ly
45
46 Internet-based information systems, individual connected devices and Internet users
47
48 (Spamhaus Malware Labs, 2020; Eremin, 2019; Smith, 2019). This is because botnets
49
50
51 have been associated with many types of Internet-based attacks and misuses, such as
52
53 large-scale DDoS attacks, email spamming, malware transmitting, virtual click
54
55 generation and sensitive information stealing. For example, the Telegram suffered from
56
57
58
a large-scale DDoS attack that they claim originated from China and related to the
59
60 protests in Hong Kong in 2019 (Smith, 2019). Another notable DDoS attack in 2019

Information Systems Security Page 2 of 22
1
2
3 was Finland suffered a DDoS attack targeting Parliamentary Election result services.
4
5
6
These services are used by the Finnish government to communicate the election
7
8 outcome with the general population (Smith, 2019). According to Symantec, botnets
9
10 produce about 95% of spam emails in the Internet in 2010 (“Symantec: Botnets now
11
12
produce 95% of spam,” 2010). In addition, other dangerous types of botnet-assisted
13
14
15 attacks are web injection, URL spoofing, DNS spoofing and sensitive data collection.
16
17 The main targets of botnet-assisted attacks usually are financial and governmental
18
Fo
19 organizations.
20
21
rP
22 During the history of development, botnets have been constantly evolving on the
23
24 Internet in terms of scale and sophistication of control techniques (Alieyan et al., 2017;
ee
25
26 Li et al., 2017). Generally, a botnet is a network of Internet connected devices that have
27
28
rR
29
been infected with a special type of malware, called bot (Alieyan et al., 2017; Li et al.,
30
31 2017). Bots are usually created by hacking groups, called botmasters. A bot running on
ev
32
33 an Internet connected device allows the botmaster to control the device remotely. The
34
iew
35
bot-infected device can be a computer, a smartphone, or an IoT device. Bots are very
36
37
38 different from other types of malware that they are highly autonomous and capable of
39
40 using communication channels to receive commands and code updates from their
On
41
42 control system. They can also periodically send their working status to their control
43
44
ly
45 system. On the other side, the botmaster first sets up and configurates the botnet control
46
47 system, or the Command & Control (C&C) server and then he sends commands and
48
49 code updates to bots through the C&C server (Alieyan et al., 2017; Li et al., 2017).
50
51
52
In order to connect to the C&C server to get commands and code updates, bots
53
54 in the botnet send DNS queries containing the C&C server’s full-qualified name to the
55
56 local DNS system to get the server’s IP address. The bots’ DNS queries look similar to
57
58
DNS queries sent by any legitimate applications. However, in order to avoid the C&C
59
60

1
2
3 server from being scanned, detected and blocked if using static name and IP address,
4
5
6
botmaster constantly changes and updates the server’s name and IP address to the
7
8 nominated DNS server using special methods, such as Fast flux (FF), or Domain
9
10 Generation Algorithms (DGA). Bots are also programmed to have the ability to
11
12
automatically generate C&C server names using the same methods. Therefore, bots can
13
14
15 still find the C&C server’s IP address by generating the server’s names, putting them
16
17 into queries and sending these queries to DNS system. Since server’s names generated
18
Fo
19 by bots are generally different from legitimate domain names, we can monitor, capture
20
21
rP
22 DNS queries and then extract domain names in DNS queries for analysis to find
23
24 signatures of bots and botnet activities (Alieyan et al., 2017; Li et al., 2017).
ee
25
26 This paper improves the botnet detection model proposed in (Hoang & Nguyen,
27
28
rR
29
2018) by adding new domain classification features in order to increase the detection
30
31 rate as well as to reduce the false alarm rates. The rest of the paper is organized as
ev
32
33 follows: Section 2 presents related works and Section 3 describes the improved DGA
34
iew
35
botnet detection model. Section 4 presents our experiments and results, and Section 5 is
36
37
38 the paper conclusion.
39
40
On
41
42
2 Related Works
43
44
ly
45 2.1 Introduction to DGA Botnets

46
47
48
As mentioned in Section 1, many botnets use DGA techniques to automatically generate
49
50 and register different domain names for their C&C servers in order to prevent these
51
52 servers from being controlled and blacklisted (Alieyan et al., 2017; Li et al., 2017;
53
54
Hoang & Nguyen, 2018). The main reason of using DGA techniques is to increase the
55
56
57 complexity of the control and revocation of registered domain names. These botnets are
58
59 called DGA-based botnets or just DGA botnets. DGA techniques use operators on
60

1
2
3 variables that have constantly changing values, such as year, month and day to generate
4
5
6
random domain names. For example, a type of DGA techniques is implemented by a
7
8 function that consists of 16 rounds. Each round randomly generates a character in the
9
10 domain name as follows (Alieyan et al., 2017; Li et al., 2017):
11
12
13
14
 year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
15
16  month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
17
18
Fo
 day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
19
20
21  domain += chr(((year ^ month ^ day) % 25) + 97).
rP
22
23
24 where:
ee
25
26
27  ‘^’ is the exponent
28
rR
29
30  ‘%’ is the modulus
31
ev
32  ‘>>’ is the right shift

33
34
 ‘<<’ is the left shift
iew
35
36
37  ‘&’ is the logical AND.
38
39
40
On
41
42
43
44
ly
45
46
47
48
49
50
51
52
53 Fig. 1. Example of a botnet using DGA technique to generate, register and query C&C
54
server names
55
56 Fig. 1 describes a DGA technique in the botnet operation. On one side the
57
58
59
botmaster randomly generates domain names using an algorithm, then registers these
60

1
2
3 domain names to the nominated DNS system and then assigns them to the C&C server.
4
5
6
On the other side, a bot in the botnet also generates a domain name for the botnet C&C
7
8 server using the same algorithm and then sends a query to the local DNS system to find
9
10 the IP address corresponding to the generated domain name. If the C&C server’s IP
11
12
address is found, the bot creates a connection to the server to get commands and code
13
14
15 updates. If the DNS query fails, or no valid IP address is found, the bot generates
16
17 another domain name and repeats the procedure to find the IP address of the botnet
18
Fo
19 C&C server.
20
21
rP
22
23 2.2 Review DGA Botnet Detection Proposals
24
ee
25
26 As mentioned in Section 2.1, bots in a botnet utilize the local DNS servers to find the IP
27
28
rR
address of the botnet’s C&C server in their daily activities. Therefore, monitoring and
29
30
31
analyzing DNS traffic or traces can help to detect activities of bots and botnet. In this
ev
32
33 direction, there have been several proposals for botnet detection, such as Villamari-
34
iew
35 Salomo & Brustoloni (2008), Perdisci et al. (2009), Jiang et al. (2010), Stalmans &
36
37 Irwin (2011), Antonakakis et al. (2011), Bilge et al. (2011), Yadav et al. (2012), Kheir
38
39
40 et al. (2014), Woodbridge et al. (2016), Truong & Cheng (2016), Hoang & Nguyen
On
41
42 (2018), Qiao et al. (2019), Zhao et al. (2019) and Hostiadi et al. (2020). In this section,
43
44 we review some closely related proposals for DGA botnet detection, including Truong
ly
45
46
47 & Cheng (2016), Hoang & Nguyen (2018), Qiao et al. (2019), Zhao et al. (2019) and
48
49 Hostiadi et al. (2020).
50
51 Truong & Cheng (2016) proposes a method to detect domain-flux botnets using
52
53
54
DNS traffic features. They use DNS domain features, including the domain length and
55
56 expected value to distinguish between legitimate and pseudo-random domain names
57
58 (PDN) generated by some botnets. The expected value of a domain name is calculated
59
60 based on the character distribution of 100,000 most popular legitimate domain names

1
2
3 ranked by Alexa (PN Pedia, n.d). The experimental dataset consists of 100,000 most
4
5
6
popular legitimate domain names ranked by Alexa (PN Pedia, n.d) and about 20,000
7
8 PDN domain names generated by Conficker and Zeus botnets. Several supervised
9
10 machine learning algorithms, including naive bayes, kNN, SVN, decision tree and
11
12
random forest have been used to construct and validate the proposed botnet detection
13
14
15 model. Experimental results show that the decision tree is the algorithm that gives the
16
17 highest overall detection accuracy of 92.30% and the false positive rate of 4.80%.
18
Fo
19 Although the proposed model’s overall detection accuracy is relatively high, its false
20
21
rP
22 positive and negative rates are also high, which are at about 7.70% in total in the best
23
24 case.
ee
25
26 Similarly, Hoang & Nguyen (2018) proposes a DGA botnet detection model
27
28
rR
29
based on the classification of legitimate and botnet generated domain names using
30
31 supervised machine learning techniques. They propose to use 18 domain features,
ev
32
33 including 16 n-gram features and 2 vowel distribution features to construct and validate
34
iew
35
the proposed model. Among 16 n-gram features, 8 features are calculated based on each
36
37
38 domain’s 2-gram substrings and other 8 features are calculated based on the domain’s 3-
39
40 gram substrings. The experimental dataset consists of 30,000 top legitimate domain
On
41
42 names ranked by Alexa (PN Pedia, n.d) and 30,000 malicious domain names used by
43
44
ly
45 DGA botnets (Netlab 360, n.d). Traditional supervised machine learning algorithms,
46
47 such as naive bayes, kNN, decision tree and random forest have been used to build and
48
49 validate the proposed model. Various experiments have been conducted using different
50
51
52
testing scenarios. The experimental results confirm that machine learning techniques
53
54 can be effectively used to detect botnets based on the classification of legitimate and
55
56 algorithm-generated domain names used by botnets. The experimental results also
57
58
confirm that the random forest algorithm produces the highest overall detection rate of
59
60

1
2
3 over 90%. However, the proposed model’s major issues are (1) the experiment dataset is
4
5
6
pretty small for each testing scenario compared to other approaches and (2) the false
7
8 positive rate is relatively high at 9.30%. Small experimental dataset will reduce the
9
10 result reliability. On the other hand a high false positive rate will restrict the
11
12
applicability of the proposed model in practice.
13
14
15 Recently, Qiao et al. (2019) proposes a method for the classification of DGA
16
17 domain names based on Long Short-Term Memory (LSTM) with attention mechanism.
18
Fo
19 LSTM is a type of supervised deep learning methods and this is relatively new approach
20
21
rP
22 in the security field. In the proposed method, each domain name is passed through pre-
23
24 processed steps of DGA string extraction, padding and embedding. It is then converted
ee
25
26 into 54×128 matrix for training and testing. The experimental dataset consists of top one
27
28
rR
29
million legitimate domain names ranked by Alexa (PN Pedia, n.d) and 1,675,404
30
31 malicious domain names generated by various DGA botnets (Netlab 360, n.d).
ev
32
33 Experimental results show that the proposed method performs better than current state-
34
iew
35
of-the-art methods with the average F1-score of 94.58%. Using the LSTM learning
36
37
38 method, the proposed model can remove the feature extraction process. However, the
39
40 paper does not provide any information about the complexity or the requirement of
On
41
42 computational resources. In addition, although the false alarm rates are not provided in
43
44
ly
45 the paper, they are relatively high at about 5%, which can be deducted from the
46
47 precision and recall of both about 95%.
48
49 On the other hand, Zhao et al. (2019) proposes a method based on statistics to
50
51
52
detect malicious domain names based on n-gram technique. Each domain name in the
53
54 training set of legitimate domains is first divided into sequences of substrings using 3, 4,
55
56 5, 6 and 7-gram technique. Then, the statistics and weight values of substrings of all
57
58
training domains are calculated to form the ‘profile’. To validate an input domain name
59
60

1
2
3 if it is legitimate or malicious, the domain name is also first divided into sequences of
4
5
6
substrings using 3, 4, 5, 6 and 7-gram technique. Then, the statistics of domain name
7
8 substrings are calculated and then it is used to calculate the ‘reputation value’ of the
9
10 domain name based on the ‘profile’. A domain reputation threshold is generated for
11
12
each category of malicious domain names using the ‘profile’. If the domain name’s
13
14
15 reputation value is greater than the threshold, it is legitimate. Otherwise, it is malicious.
16
17 Experimental results show that the proposed approach achieves the detection accuracy
18
Fo
19 of 94.04%. However, the proposed approach’s detection performance heavily depends

20
21
rP
22 on the selection of the domain reputation threshold that is currently generated and
23
24 selected manually. Furthermore, its false positive and negative rates are fairly high at
ee
25
26 6.14% and 7.42%, respectively.
27
28
rR
29
Using a different approach to detect botnet activities, Hostiadi et al. (2020)
30
31 proposes the B-Corr model for bot group activity detection based on the analysis of
ev
32
33 network flows traffic. Instead of detecting single bot’s activities, the B-Corr model
34
iew
35
focuses on detecting activities of bot groups. Bot group activity detection can help
36
37
38 network administrators to isolate an activity or access of bot group attacks and to
39
40 determine the relations among bots and to measure their correlation. The B-Corr model
On
41
42 consists of three phases, including (i) the feature extraction from bot activity flows, (ii)
43
44
ly
45 the intersection measurement among bots, and (iii) the similarity value production. B-
46
47 Corr model classifies similar bots with a similar target to specify activities of the bot
48
49 group. In order to achieve a more comprehensive view, the B-Corr visualizes the
50
51
52
similarity values among bots in the form of a similar bot graph. In addition, extensive
53
54 experiments conducted on various scenarios using real botnet datasets confirm a high
55
56 detection accuracy. The model’s detection accuracy of bot group activity IP addresses is
57
58
89.16 %. The advantage of the proposed approach is it is able to detect the activities of
59
60

1
2
3 bot groups with high accuracy. However, the reliability of the detection accuracy is
4
5
6
questionable because the numbers of bot groups and group activities found in the
7
8 experimental datasets are small.
9
10 Table 1 gives the general comparison of previous proposals in the following
11
12
aspects: used techniques to construct the detection model, the accuracy (ACC) and F1-
13
14
15 score, the advantages and the disadvantages.
16
17
18 Table 1. The proposed model’s detection performance versus other proposals
Fo
19
20 Used ACC F1
21 Approaches Advantages Disadvantages
Technique (%) (%)
rP
22
23 Truong & J48
- High false alarm
24 Cheng (2016)- decision 92.30 - Simple and fast
rates (about 7.70%)
ee
25 J48 tree
26
27 Hoang & Various - Small dataset
28 - Relatively
rR
Nguyen learning 90.90 90.90 - High false positive rate

29 simple and fast
(2018) methods (about 9.30%)
30
31 - Requires extensive
ev
32 LSTM computing resources

Qiao et al.
33 deep 94.58 - High accuracy
34 (2019) - False alarm rates
learning
iew
35 (about 5%)
36
37
- Difficult to select
38 Zhao et al. n-gram detection threshold
94.04 - High accuracy
39 (2019) statistics - High false alarm
40 rates (FNR = 7.42%)
On
41
42 - Can detect
43
Hostiadi et - High false alarm
Statistics 89.16 activities of bot
44 al. (2020) rates (about 10%)
group
ly
45
46
47
48 In this paper, we extend the idea proposed by Hoang & Nguyen (2018) that uses
49
50 machine learning techniques for botnet detection. Machine learning techniques have
51
52
53 been widely used in many areas, including computer science and security fields
54
55 (Alieyan et al., 2017; Li et al., 2017; Sangani & Zarger, 2017; Chatrati et al., 2020;
56
57 Gaurav et al., 2020; Taheri et al., 2020-1; Taheri et al., 2020-2). Traditional learning
58
59
techniques, such as decision tree and random forest have an important advantage over
60

1
2
3 deep learning techniques that they are much faster and require less computational
4
5
6
resources (Sangani & Zarger, 2017). Specifically, we improve the botnet detection
7
8 model proposed in Hoang & Nguyen (2018), by adding 7 new domain classification
9
10 features in order to improve the detection rate as well as to reduce the false positive and
11
12
negative rates. The random forest machine learning algorithm is used to build the new
13
14
15 model because it is relatively fast and it has also been proven to give a higher accuracy
16
17 through cross validation in security field over other traditional supervised algorithms,
18
Fo
19 such as naive bayes, kNN, SVM and decision tree (Alieyan et al., 2017; Li et al., 2017;
20
21
rP
22 Hoang & Nguyen, 2018).

23
24
ee
25
3 The Improved DGA Botnet Detection Model
26
27
28
rR
29 3.1 The Improved Detection Model

30
31
Fig. 2 presents the improved model for detecting DGA botnets. The new model consists
ev
32
33
34 of two stages, including (a) the training stage and the (b) the detection stage. In the
iew
35
36 training stage, the detection model is constructed from the training data while the
37
38 constructed model is used to classify each test domain name if it is a legitimate or
39
40
On
41 botnet domain name.

42
43 The training stage as shown in Fig. 2(a) has two steps:
44
ly
45
46 (1) Feature extraction: The training set of legitimate and DGA botnet domain names
47
48
49
is put into the feature extraction process, in which 24 classification features of
50
51 each domain name are extracted. Each domain name is converted to a vector of
52
53 24 features and a class label. The final result of the feature extraction step is a 2-
54
55
dimensional training data matrix of M domains and N features;
56
57
58 (2) Training: In the training step, the training data matrix is used to build the
59
60 ‘Classifier’ or the detection model using the random forest machine learning

1
2
3 algorithm. In this step the constructed model is also validated using the 10-fold
4
5
6
cross-validation method, in which 80% of the data set is used for training and
7
8 20% of the data set is used for testing.
9
10
11
12
13
14
15
16
17
18
Fo
19
20
21
rP
22
23
24
ee
25
26
27
28
rR
29
30
31
ev
32
Fig. 2. The improved detection model: (a) the training stage and (b) the detection stage
33
34
iew
35 The detection stage as shown in Fig. 2(b) also has two steps:
36
37
38
(1) Feature extraction: The test domain names are put into the feature extraction
39
40
On
41 process using the same procedure as done in the training stage. Each test domain
42
43 name is converted to a vector of 24 features;
44
ly
45 (2) Classification: In this step, each test domain’s vector is classified using the
46
47
48 ‘Classifier’ built in the training stage. The result of the step is the predicted label
49
50 of the test domain name of either legitimate or botnet.
51
52
53
54 3.2 Extraction of Classification Features
55
56
57
As mentioned in Section 3.1, 24 classification features are extracted for each domain
58
59 name, as described in Table 2. These features include 17 features (from f1 to f17)
60

1
2
3 proposed in (Hoang & Nguyen, 2018), 1 feature (f23) proposed in (Yu et al., 2017), 1
4
5
6
feature (f24) proposed in (Truong & Cheng, 2016) and 5 new features (f18 to f22).
7
8
9 Table 2. Domain classification features used in the proposed model
10
11 Feature Meanings Calculation formulars
No.
12 names
13 f1 count(d) Number of 2-grams of domain name d, which are also found in DS(2-gram).
14 DS(2-gram) is the list of N most frequent 2-grams.
15
16 f2 m(d) 2-gram frequency distribution of domain (1)
17 name d. f(i) is the total number of
18 occurrences of 2-gram i in DS(2-gram)
Fo
19 and index(i) is the rank of 2-gram i in

20 TS(2-gram). TS(2-gram) is total possible
21 number of 2-grams.
rP
22 f3 s(d) 2-gram weight of domain name d. vt(i) is (2)

23 the rank of 2-gram i in DS(2-gram).
24
ee
25 f4 ma(d) Average of 2-gram frequency distribution

26 ma(d) = m(d) / len(d) (3)
of domain name d. len(d) is the number
27
of 2-grams in domain d.
28
rR
29 f5 sa(d) Average of 2-gram weight of domain sa(d) = s(d) / len(d) (4)

30 name d.
31 f6 tan(d) Average number of popular 2-grams of tan(d) = count(d) / len(d) (5)
ev
32 domain name d.
33
34 f7 taf(d) Average frequency of popular 2-grams of (6)
the domain name d.
iew
35
36
f8 ent(d) 2-gram entropy of domain name d. L = N (7)
37
38 is most frequent 2-grams.
39 f9 count(d) Number of 3-grams of domain name d, which are also found in DS(3-gram).
40 DS(3-gram) is the list of N most frequent 3-grams.
On
41 f10 m(d) 3-gram frequency distribution of domain (8)

42 name d. f(i) is the total number of
43
occurrences of 3-gram i in DS(3-gram)
44
ly
and index(i) is the rank of 3-gram i in

45
46
TS(3-gram). TS(3-gram) is total possible
47 number of 3-grams.
48 f11 s(d) 3-gram weight of domain name d. vt(i) is (9)
49 the rank of 3-gram i in DS(3-gram).
50
51 f12 ma(d) Average of 3-gram frequency distribution ma(d) = m(d) / len(d) (10)
52 of domain name d. len(d) is the number
53 of 3-grams in domain d.
54
f13 sa(d) Average of 3-gram weight of domain sa(d) = s(d) / len(d) (11)
55
name d.
56
57 f14 tan(d) Average number of popular 3-grams of tan(d) = count(d) / len(d) (12)
58 domain name d.
59
60

1
2
3 f15 taf(d) Average frequency of popular 3-grams of (13)
4
the domain name d.
5
6
f16 ent(d) 3-gram entropy of domain name d. L = M (14)
7
is most frequent 3-grams.
8
9 f17 tanv(d) Vowel distribution of domain name d.
10 countnv(d) is the number of vowels in tanv(d) = countnv(d) / len_char(d) (15)
11 domain d. len_char(d) is the number of
12 characters of domain d.
13 f18 tanco(d) Consonant distribution of domain tanco(d) = countco(d) / len_char(d) (16)
14
name d. countco(d) is the number of
15
consonants in domain d.
16
17 f19 tandi(d) Digit distribution of domain name d. tandi(d) = countdi(d) / len_char(d) (17)
18 countdi(d) is the number of digits in
Fo
19 domain d.
20 f20 tansc(d) Special character distribution of domain tansc(d) = countsc(d) / len_char(d) (18)
21 name d. countsc(d) is the number of
rP
22 special characters in domain d.

23
24 f21 tanhe(d) Hexadecimal character distribution of tanhe(d) = counthe(d) / len_char(d) (19)
domain name d. counthe(d) is the number
ee
25
26 of hexadecimal characters in domain d.
27 f22 is_digit First character is digit or not 1 if first character is digit, 0
28
rR
otherwise
29
30 f23 ent_ Character entropy of domain d. D(x) is
31 char(d) probability distribution of character x in (20)
ev
32 domain d.
33 f24 EOD(d) Expected value of domain d. Domain d
34 consists of k unique characters as {x1, (21)
iew
35 x2,..., xk}. n(xi) is the occurrence

36 frequency of character xi and p(xi) is
37 probability distribution of character xi.
38 p(xi) is calculated using 100,000 top
39 domains listed by Alexa.
40
On
41
42 3.3 Classification Measurements
43
44 We use 6 measurements, including TPR, FPR, FNR, PPV, F1 and ACC to measure the
ly
45
46
proposed model’s performance as follows:
47
48
49
(22)
50
51
52 (23)
53
54 (24)
55
56
57 (25)
58
59
60

1
2
3
4 (26)
5
6
7 (27)
8
9
10 where TP, FP, FN and TN are elements of the confusion matrix given in Table 3.
11
12
13 Table 3. TP, FP, FN and TN in the confusion matrix
14
15 Actual Class
16
17 Botnet Legitimate
18
Fo
19 Predicted Botnet TP (True Positives) FP (False Positives)

20 Class Legitimate FN (False Negatives) TN (True Negatives)
21
rP
22
23 In addition, we use the Detection Rate (DR) to measure the effectiveness of the
24
ee
25
26 proposed detection model for classifying domain names of various botnets. The DR for
27
28 each botnet type is calculated as follows:
rR
29
30
31 (28)
ev
32
33
34
4 Experiments and Results
iew
35
36
37
38 4.1 Experimental Dataset
39
40 The experimental dataset is a combination of a subset of legitimate domain names and
On
41
42 another subset of DGA botnet domain names as follows:
43
44
ly
45
 Top 100,000 legitimate domain names listed by Alexa (PN Pedia, n.d). We
46
47
48 manually download and validate the list of domain names to remove duplicated
49
50 domain names;
51
52
 153,200 C&C server domain names of various DGA botnets listed in (Netlab
53
54
55 360, n.d). These domain names were generated and used by common DGA
56
57 botnets, such as banjori, emotet, gameover, murofet and necurs. In this subset,
58
59 we use 100,000 domain names for training and 53,200 domain names for testing.
60

1
2
3 The experimental dataset of 100,000 legitimate domains and 100,000 botnet domains is
4
5
6
used for training to construct and validate the ‘Classifier’ or the detection model. 53,200
7
8 botnet domains that are not in the training set are used for testing only.
9
10
11
12 4.2 Experimental Results
13
14 As mentioned in Section 4.1, the training set of 200,000 domain names is used for
15
16 constructing and validating the detection model using the random forest machine
17
18
Fo
learning algorithm. We randomly take 80% of dataset for training and the remaining
19
20
21
20% for testing. The 10-fold cross-validation method is used to compute the average
rP
22
23 results. Table 4 shows the proposed model’s detection performance compared to
24
ee
25 previous proposals. Table 5 shows the detection accuracy of the proposed model on the
26
27
test dataset of 39 DGA botnets that generate and use a large number of algorithm-
28
rR
29
30 generated domain names. From the result on Table 5, we divide DGA botnets into 3
31
ev
32 groups based on detection accuracy: the first group with the detection accuracy of over
33
34 90% shown on Table 6, the second group with the detection accuracy over 70% to
iew
35
36
37 lower that 90% shown on Table 7 and the last group with very low detection accuracy
38
39 shown on Table 8. The macro DR is the overall detection rate calculated based on the
40
On
41 average of all botnets’ DR, while the micro DR is the ratio of total number of correctly
42
43
44
detected domains and total number of testing domains.
ly
45
46
47 Table 4. The proposed model’s detection performance versus other proposals
48
49 Approaches PPV TPR FPR FNR ACC F1
50
51 Truong & Cheng (2016)-J48 94.70 4.80 92.30
52
53 Hoang & Nguyen (2018)-RF 90.70 91.00 9.30 90.90 90.90
54
55 Qiao et al. (2019) 95.05 95.14 94.58
56
Zhao et al. (2019) 6.14 7.42 94.04
57
58 Our model-37 tree RF 97.08 96.98 2.92 3.02 97.03 97.03
59
60

1
2
3 Table 5. The proposed model’s detection accuracy on all DGA botnets
4
5 No. Botnet names Total domains Correctly detected DR (%)
6
7 1 banjori 4000 0 0
8
9
2 emotet 4000 3994 99.85
10 3 gameover 4000 4000 100
11
12 4 murofet 4000 3994 99.85
13 5 necurs 4000 3947 98.67
14
15 6 pykspa_v1 4000 3621 90.53
16
17
7 ramnit 4000 3888 97.20
18
Fo
8 ranbyus 4000 3993 99.82

19
20 9 rovnix 4000 4000 100
21 10 shiotob 4000 3892 99.55
rP
22
23 11 simda 4000 2944 73.60
24
12 symmi 1200 1162 96.83
ee
25
26 13 tinba 4000 3951 98.77
27
28 14 virut 4000 3293 82.32
rR
29
15 mydoom 50 46 92.00
30
31 16 tinynuke 32 31 96.88
ev
32
33 17 proslikefan 100 84 84.00
34 18 vidro 100 98 98.00
iew
35
36 19 gspy 100 91 91.00
37
20 tempedreve 195 172 88.21
38
39 21 pykspa_v2_real 199 178 89.45
40
On
41 22 pykspa_v2_fake 799 732 91.61

42 23 padcrypt 168 166 98.81
43
44 24 fobber_v1 298 298 100
ly
45
46
25 fobber_v2 299 288 96.32
47 26 conficker 495 387 78.18
48
49 27 nymaim 480 415 86.46
50 28 enviserv 500 382 76.40
51
52 29 vawtrak 827 692 83.68
53
54
30 dircrypt 762 742 97.38
55 31 matsnu 881 10 1.14
56
57 32 cryptolocker 1000 990 99.00
58 33 locky 1158 1098 94.81
59
60 34 chinad 1000 1000 100

1
2
3
4
35 bigviktor 999 30 3.00
5 36 shifu 2546 2262 88.85
6
7 37 qadars 2000 1970 98.50
8 38 dyre 1000 980 98.00
9
10 39 suppobox 2205 21 0.95
11
12 Micro Overall 71393 59842 83.82
13 Macro Overall 83.83
14
15
16
Table 6. The proposed model’s over 90% detection accuracy on 25 DGA botnets
17
Fo
19 1 emotet 4000 3994 99.85

20
21 2 gameover 4000 4000 100
rP
22 3 murofet 4000 3994 99.85

23
24 4 necurs 4000 3947 98.67
ee
25 5 pykspa_v1 4000 3621 90.53

26
27 6 ramnit 4000 3888 97.20
28
rR
7 ranbyus 4000 3993 99.82

29
30 8 rovnix 4000 4000 100
31 9 shiotob 4000 3892 99.55
ev
32
33 10 symmi 1200 1162 96.83
34 11 tinba 4000 3951 98.77
iew
35
36 12 mydoom 50 46 92.00
37
13 tinynuke 32 31 96.88
38
39 14 vidro 100 98 98.00
40
On
15 gspy 100 91 91.00

41
42 16 pykspa_v2_fake 799 732 91.61
43
44
17 padcrypt 168 166 98.81
ly
45 18 fobber_v1 298 298 100

46
47
19 fobber_v2 299 288 96.32
48 20 dircrypt 762 742 97.38
49
50 21 cryptolocker 1000 990 99.00
51 22 locky 1158 1098 94.81
52
53 23 chinad 1000 1000 100
54 24 qadars 2000 1970 98.50
55
56 25 dyre 1000 980 98.00
57 Micro DR 49966 48972 98.01
58
59 Macro DR 97.34
60

1
2
3 Table 7. The proposed model’s over 70% detection accuracy on 10 DGA botnets
4
6
7 1 simda 4000 2944 73.60
8
9
2 virut 4000 3293 82.32
10 3 proslikefan 100 84 84.00
11
12 4 tempedreve 195 172 88.21
13 5 pykspa_v2_real 199 178 89.45
14
15 6 conficker 495 387 78.18
16
17
7 nymaim 480 415 86.46
18
Fo
8 enviserv 500 382 76.40

19
20 9 vawtrak 827 692 83.68
21 10 shifu 2546 2262 88.85
rP
22
23 Micro DR 13342 10809 81.01
24
Macro DR 83.11
ee
25
26
27 Table 8. The proposed model’s low detection accuracy on some DGA botnets
28
rR

30
31
1 banjori 4000 0 0
ev
32 2 matsnu 881 10 1.14

33
34 3 bigviktor 999 30 3.00
iew
35 4 suppobox 2205 21 0.95

36
37 Micro DR 8085 61 0.75
38
39 Overall DR 1.27
40
On
41
4.3 Discussion
42
43
Based on the experimental results shown in Table 4, Table 5, Table 6, Table 7 and
44
ly
45
46 Table 8, we can draw the following comments:
47
48
49  The proposed detection model outperforms previous proposals in all
50
51
measurements, in which our model produces much higher overall accuracy and
52
53
54 F1-score than previous models as well as much lower false positive rate (FPR)
55
56 and false negative rate (FNR). For example, the F1-scores of Hoang & Nguyen
57
58
59
60

1
2
3 (2018), Qiao et al. (2019) and our model are 90.90, 94.58 and 97.03,
4
5
6
respectively.
7
8  Our model is capable of effectively detecting most DGA botnets, as shown in
9
10 Table 4. Out of 39 DGA botnets, malicious domain names used by 25 botnets
11
12
13
are detected with the detection accuracy of over 90%, as shown in Table 6.
14
15 Specifically, the micro and macro DRs of this botnet group are 98.01% and
16
17 97.34%, respectively. 10 DGA botnets in the second group, as shown in Table 7
18
Fo
19
also have relatively high micro and macro DRs of 83.01% and 83.11%,
20
21
rP
22 respectively. The reason that the proposed model performs well on detecting
23
24 domain names of these DGA botnets because domain names generated by these
ee
25
26 botnets are generally different from legitimate domain names.
27
28
rR
29  As shown in Table 8, the proposed model fails to detect domain names used by
30
31 4 botnets, including ‘banjori’, ‘matsnu’, ‘bigviktor’ and ‘suppobox’.
ev
32
33 Specifically, the model cannot detect any domain names generated by ‘banjori’
34
iew
35
36 botnet and it can only detect some domain names generated by ‘matsnu’,
37
38 ‘bigviktor’ and ‘suppobox’ botnets. This is because these botnets are able to
39
40 generate domain names that are very similar to legitimate domain names.
On
41
42
43
Nevertheless, these botnets in this group only account for a small portion of
44
ly
45 domain names generated and used by all DGA botnets.

46
47
48
49 5 Conclusion
50
51 This paper proposes an improved model for detecting DGA botnets based on the
52
53 random forest machine learning algorithm. The proposed model enhances the botnet
54
55 detection model proposed by Hoang & Nguyen (2018) by adding 7 classification
56
57
features in order to increase the detection rate as well as to reduce false alarm rates.
58
59
60 Experimental results confirm that our model performs much better than the original

1
2
3 model (Hoang & Nguyen, 2018) and other proposal based on traditional supervised
4
5
6
machine learning (Truong & Cheng, 2016), deep learning (Qiao et al., 2019) and
7
8 statistical method (Zhao et al., 2019). Moreover, our model can effectively detect most
9
10 DGA botnets in the test dataset with the overall detection accuracy of over 90%.
11
12
For future work, we will continue to improve our model so that it is able to
13
14
15 detect DGA botnets, such as ‘banjori’, ‘matsnu’, ‘bigviktor’ and ‘suppobox’, which
16
17 generate domain names similar to legitimate domain names.
18
Fo
19
20
21 References
rP
22
23 Alieyan, K., Almomani, A., Manasrah, A., Kadhum, M.M. (2017). A survey of botnet
24
detection based on DNS. Nat. Comput. Appl. Forum, pp.1541–1558.
ee
25
26
27
Antonakakis, M., Perdisci, R., Lee, W., Vasiloglou, N., Dagon, D. (2011). Detecting
28
rR
malware domains at the upper DNS hierarchy. In SEC'11: Proceedings of the

29
30 20th USENIX conference on Security.
31
ev
32 Bilge, L., Kirda, E., Kruegel, C., Balduzzi, M. (2011). EXPOSURE: finding malicious
33
domains using passive DNS analysis. In: NDSS.
34
iew
35 Chatrati, S.P., Hossain, G., Goyal, A., Bhan, A., Bhattacharya, S., Gaurav, D., Tiwari,
36
37 S.M. (2020). Smart home health monitoring system for predicting type 2
38
39 diabetes and hypertension. Journal of King Saud University - Computer and
40
On
Information Sciences. Jan 2020. https://doi.org/10.1016/j.jksuci.2020.01.010.

41
42 DN Pedia. (n.d.). Top Alexa one million domains. https://dnpedia.com /tlds/topm.php.
43
44 Eremin, A. (2019, March 29). Bots and botnets in 2018: Statistics on botnet attacks on
ly
45
46
clients of organizations. AO Kaspersky Lab. https://securelist.com/bots-and-
47 botnets-in-2018/90091/
48
49 Jiang, N., Cao, J., Jin, Y., Li, L., Zhang, Z.L. (2010). Identifying suspicious activities
50
51 through DNS failure graph analysis. In 18th IEEE international conference on
52
network protocols (ICNP), pp 144–153.
53
54 Kheir, N., Tran, F., Caron, P., Deschamps, N. (2014). Mentor: positive DNS reputation
55
56 to skim-off benign domains in botnet C&C blacklists. In ICT systems security
57
58 and privacy protection. Springer, Berlin, Heidelberg, pp 1–14.
59
60

1
2
3 Gaurav D., Shandilya S., Tiwari S., Goyal A. (2020) A Machine Learning Method for
4
5 Recognizing Invasive Content in Memes. In: Villazón-Terrazas B., Ortiz-
6
7 Rodríguez F., Tiwari S.M., Shandilya S.K. (eds) Knowledge Graphs and
8
Semantic Web. KGSWC 2020. Communications in Computer and Information
9
10 Science, vol. 1232. Springer, Cham. https://doi.org/10.1007/978-3-030-65384-
11
12 2_15.
13
14 Hoang, X.D., Nguyen, Q.C. (2018). Botnet Detection Based on Machine Learning
15 Techniques Using DNS Query Data. J. Future Internet, 2018, 10, 43,
16
17 https://doi.org/10.3390/fi10050043.
18
Fo
19 Hostiadi, D.P., Wibisono, W., Ahmad, T. (2020). B-Corr Model for Bot Group Activity
20
21
Detection Based on Network Flows Traffic Analysis. KSII Transactions on
rP
22 Internet and Information Systems, 14, 10, (2020), 4176-4197. DOI:

23
24 10.3837/tiis.2020.10.014.
ee
25
26 Li, X., Wang, J., Zhang, X. (2017). Botnet Detection Technology Based on DNS. J.
27
Future Internet, 2017, 9, 55.
28
rR
29 Netlab 360. (n.d.). – DGA Families. Available online: https://data.netlab.360.com/dga/

30
31 (accessed on 10 August 2020).
ev
32
33 Perdisci, R., Corona, I., Dagon, D., Lee, W. (2009). Detecting malicious flux service
34 networks through passive analysis of recursive DNS traces. In IEEE Annual
iew
35
36 computer security applications conference, (ACSAC’09), pp 311–320.
37
38 Qiao, Y., Zhang, B., Zhang, W., Sangaiah, A.K., and Wu, H. (2019). DGA Domain
39
40
Name Classification Method Based on Long Short-Term Memory with
On
41 Attention Mechanism. Appl. Sci. 2019, 9, 4205;

42
43 https://doi.org/10.3390/app9204205.
44
ly
45 Sangani, N.K., Zarger, H. (2017). Machine Learning in Application Security, Book

46 chapter in "Advances in Security in Computing and Communications",
47
48 IntechOpen.
49
50 Smith, A. (2019, October 23). More Destructive Botnets and Attack Vectors Are on
51
52
Their Way. Radware Blog. https://blog.radware.com/security/botnets/2019
53 /10/scan-exploit-control/
54
55 Spamhaus Malware Labs. (2020, January 28). Spamhaus Botnet Threat Report 2019.
56
57 https://www.spamhaus.org/news/article/793/spamhaus-botnet-threat-report-2019
58
59
60

1
2
3 Stalmans, E., Irwin, B. (2011). A framework for DNS based detection and mitigation of
4
5 malware infections on a network. In IEEE Information security South Africa
6
7 (ISSA), pp 1–8.
8
Symantec: Botnets now produce 95% of spam [Editorial]. (2010, August 24). American
9
10 City Business Journals. https://www.bizjournals.com/sanjose/stories/2010/08/23
11
12 /daily29.html.
13
14 Taheri, R., Ghahramani, M., Javidan, R., Shojafar, M., Pooranian, Z., Conti, M. (2020-
15 1). Similarity-based Android Malware Detection Using Hamming Distance of
16
17 Static Binary Features. Future Generation Computer Systems. Vol. 105, April
18
Fo
19 2020, Pages 230-247. https://doi.org/10.1016/j.future.2019.11.034.

20
21
Taheri, R., Javidan, R., Shojafar, M., Pooranian, Z., Miri, A., Conti, M. (2020-2). On
rP
22 defending against label flipping attacks on malware detection systems. Neural

23
24 Computing and Applications. Vol. 32, July 2020, Pages 14781–14800.
ee
25
26 https://doi.org/10.1007/s00521-020-04831-9.
27
Truong, D.T, Cheng, G. (2016). Detecting domain-flux botnet based on DNS traffic
28
rR
29 features in managed network. Security Comm. Networks 2016; 9: pp.2338–2347;

30
31 John Wiley & Sons.
ev
32
33 Villamari-Salomo, R., Brustoloni, J.C. (2008). Identifying botnets using anomaly
34 detection techniques applied to DNS traffic. In 5th IEEE consumer
iew
35
36 communications and networking conference (CCNC 2008), pp 476–481.
37
38 Yadav, S., Reddy, A.K.K., Reddy, A.L.N., Ranjan, S. (2012). Detecting
39
40
Algorithmically Generated Domain-Flux Attacks With DNS Traffic Analysis.
On
41 IEEE/ACM Trans. Netw. 2012, 20, 1663–1677.

42
43 doi:10.1109/TNET.2012.2184552.
44
ly
45 Woodbridge, J., Anderson, H.S., Ahuja, A., Grant, D. (2016). Predicting Domain
46 Generation Algorithms with Long Short-Term Memory Networks. arXiv 2016,
47
48 arXiv:1611.00791.
49
50 Yu, B., Gray, D., Pan, J., De Cock, M., Nascimento, A. (2017). “Inline DGA detection
51
52
with deep networks,” in IEEE International Conference on Data Mining
53 Workshops (ICDMW), pp. 683–692.
54
55 Zhao, H., Chang, Z., Bao, G., Zeng, X. (2019). Malicious Domain Names Detection
56
57 Algorithm Based on N-Gram. Journal of Computer Networks and
58
Communications 2019. Vol. 2019, Hindawi,
59
60 https://doi.org/10.1155/2019/4612474.

UISS-2020-0216.R1 Proof Hi

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UISS-2020-0216.R1 Proof Hi

Uploaded by

Copyright:

Available Formats

Information Systems Security

Journal: Information Security Journal: A Global Perspective

Manuscript Type: Original Article

Information Security and Risk Management, Application Security,

DGA botnet detection, fast-flux botnet detection, botnet detection model,

machine learning-based botnet detection

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

35 model, machine learning-based botnet detection

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

45 2.1 Introduction to DGA Botnets

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

32  ‘>>’ is the right shift

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

19 of 94.04%. However, the proposed approach’s detection performance heavily depends

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

Nguyen learning 90.90 90.90 - High false positive rate

32 LSTM computing resources

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

22 Hoang & Nguyen, 2018).

29 3.1 The Improved Detection Model

41 botnet domain name.

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

19 and index(i) is the rank of 2-gram i in

22 f3 s(d) 2-gram weight of domain name d. vt(i) is (2)

25 f4 ma(d) Average of 2-gram frequency distribution

29 f5 sa(d) Average of 2-gram weight of domain sa(d) = s(d) / len(d) (4)

41 f10 m(d) 3-gram frequency distribution of domain (8)

and index(i) is the rank of 3-gram i in

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

22 special characters in domain d.

35 x2,..., xk}. n(xi) is the occurrence

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

19 Predicted Botnet TP (True Positives) FP (False Positives)

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

8 ranbyus 4000 3993 99.82

41 22 pykspa_v2_fake 799 732 91.61

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

19 1 emotet 4000 3994 99.85

22 3 murofet 4000 3994 99.85

25 5 pykspa_v1 4000 3621 90.53

7 ranbyus 4000 3993 99.82

15 gspy 100 91 91.00

45 18 fobber_v1 298 298 100

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

8 enviserv 500 382 76.40

29 No. Botnet names Total domains Correctly detected DR (%)

32 2 matsnu 881 10 1.14

35 4 suppobox 2205 21 0.95

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

45 domain names generated and used by all DGA botnets.

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

malware domains at the upper DNS hierarchy. In SEC'11: Proceedings of the

Information Sciences. Jan 2020. https://doi.org/10.1016/j.jksuci.2020.01.010.

URL: http://mc.manuscriptcentral.com/uiss Email: uiss-peerreview@journals.tandf.co.uk

22 Internet and Information Systems, 14, 10, (2020), 4176-4197. DOI:

29 Netlab 360. (n.d.). – DGA Families. Available online: https://data.netlab.360.com/dga/