Professional Documents
Culture Documents
Ref 4
Ref 4
Abstract—Malware is one of the most damaging security n-gram extraction [4][5] have also been developed. A major
threats facing the Internet today. Despite the burgeoning lit- limitation of the signature based approach is its failure to
erature, accurate detection of malware remains an elusive and detect zero-day attacks, which are emerging threats previ-
challenging endeavor due to the increasing usage of payload
encryption and sophisticated obfuscation methods. Also, the large ously unknown to the malware detector system. Furthermore,
variety of malware classes coupled with their rapid proliferation it fails to detect threats with evolving capabilities such as
and polymorphic capabilities and imperfections of real-world metamorphic and polymorphic malwares. Metamorphic mal-
data (noise, missing values, etc) continue to hinder the use of wares automatically reprogram themselves every time they are
more sophisticated detection algorithms. This paper presents a distributed or propagated. So, they will be difficult to capture
novel machine learning based framework to detect known and
newly emerging malware at a high precision using layer 3 and with signature-based approach as their signatures have to be
layer 4 network traffic features. The framework leverages the continuously evolved [6]. Similarly, polymorphic malwares are
accuracy of supervised classification in detecting known classes also difficult to identify as they self-mutate and use encryption
with the adaptability of unsupervised learning in detecting new to avoid detection. Methods that rely on inspection of payload
classes. It also introduces a tree-based feature transformation signatures can be easily defeated by encryption or obfuscation
to overcome issues due to imperfections of the data and to
construct more informative features for the malware detection techniques.
task. We demonstrate the effectiveness of the framework using A related approach to signature-based detection is super-
real network data from a large Internet service provider. vised classification, which uses instances of known malware
to build a classification model that distinguishes the known
I. I NTRODUCTION threats from other programs. Supervised classification ap-
Malware is a malicious software program deployed by proaches also share the same limitation as signature-based
attackers to compromise a computer system (including its detection in that they both perform poorly on new and evolving
applications, data, and network infrastructure) by exploiting malware. In addition, building an accurate classification model
its security vulnerabilities. Such programs have potential to is a challenge due to the diversity of malware classes, their
disrupt the normal operations of the system, while giving the imbalanced distribution, as well as the data imperfection issues
attackers unauthorized access to the system resources and al- (noise, missing values, and correlated features) that continue to
lowing them to gather information covertly from users without hinder the adoption of more sophisticated learning algorithms.
their consent. Malware attacks have escalated in recent years On the other hand, anomaly-based detection techniques
with a profit motive in mind. These attacks, which include use the knowledge of what is considered as good behavior
botnets, spyware, keystroke loggers, and trojans, have been to identify malicious programs [7][8][9]. A special class
used to commit identity theft, spamming, and phishing scams. of anomaly based detection is specification-based detection,
Due to their rapid proliferation and increasing sophistication, which makes use of certain rule set of what is considered
the need for more advanced techniques to effectively detect acceptable behavior of programs. Programs violating such rule
malware attacks has become increasingly critical. set are declared as malicious. Statistical and unsupervised
Known techniques for malware detection can be broadly learning techniques for anomaly detection have also been
classified into two categories, namely, signature-based and extensively investigated in recent years. The key advantage of
anomaly-based detection techniques [1][2]. Signature based anomaly-based detection is its ability to detect zero-day attacks
detection employs a set of pre-defined signatures for known [10]. However, it is susceptible to producing a high false alarm
malware to determine whether the software program under rate, which means a large number of good programs will be
inspection is malicious. For example, Sung et al. [3] created a wrongly identified as malicious.
signature based detection system called SAVE (Static Analyzer Our goal is to detect and isolate malicious flows from the
of Vicious Executables) that compares API sequences ex- network traffic and further classify them as a specific type
tracted from a program to those stored in a signature database. of the known malwares, variations of the known malwares
Programs whose signature matches those found in the database or as a completely new malware strain. Towards this end, we
will be flagged as malware. The signatures for malware are develop a novel machine-learning based malware detection and
often created manually, though automated methods based on classification system using the layer-3 and layer-4 network
of the behavior for two malwares, Conficker and Tidserv. For belong to a few of the most dominant classes. Most
Conficker, the payload size of client-to-server communication classifiers are therefore biased toward detecting the larger
is either 244 or 169 bytes, whereas for Tidserv it varies widely. classes while completely missing the rare ones.
However, if we consider the server-to-client communication, • New class discovery: Not all classes are present at the
Conficker flows always have payload size 2817 or 109 bytes, time the classifier is initially trained. This makes it hard
whereas the payload size for Tidserv is always 0 or 244 for conventional classifiers to detect new classes.
bytes. The regularity observed in the layer-3/layer-4 features • Missing values: The features used to characterize the net-
among the malicious flows suggests the possibility of using work flows may contain missing values. Though numer-
such features to effectively detect the known malware classes. ous strategies for handling missing values are available,
This begs the question whether such features can also be our experiments suggest that they are ineffective when
effectively used to detect variants of the known malwares and applied to the data collected in this domain.
new malwares. This would require sophisticated classifiers that • Noise in the training data: The training examples are
can generalize beyond the classes observed in its training data. often labeled using a signature-based intrusion detection
Feature selection [11] would also be an important part since system (IDS). Since the signatures do not provide a
the raw features may not be sufficient to detect such malwares. complete coverage for all malwares, training instances
Our proposed framework leverages the accuracy of super- labeled as good are unreliable as they may include
vised classification on known classes with the adaptability malwares whose signatures have not been created yet.
of unsupervised learning for new malware detection. Specif- Due to the aforementioned challenges, existing techniques
ically, a two-level classifier is employed to address the issue are insufficient to effectively classify both the known and new
of imbalanced class distribution due to the disproportionate malwares. This paper seeks to develop a viable solution to
number of non-malicious and malicious network flows. A overcome these challenges.
macro-level classifier is initially used to isolate the malicious
flows from non-malicious ones. Next, among those flows III. P ROPOSED F RAMEWORK
classified as malicious, micro-level classifiers based on one- The proposed framework is envisioned as a key component
class support vector machine (SVM) are applied to identify of a larger end-to-end security system that monitors the net-
the specific strain of malware. To extend their ability to work flow and deciding whether it is malicious or good (more
detect malware classes beyond those in the training set, a precisely, unknown, because they may correspond to malicious
probabilistic class-based profiling method is developed. Using flows that do not have a detection signature yet). Figure 2
this “unsupervised” learning approach, we demonstrate the shows the proposed system architecture, which consists of
effectiveness of our framework in determining whether a the following six major components: (i) data capture module,
malicious flow corresponds to a variant of the known malware (ii) an intrusion detection/prevention system (IDS/IPS), (iii)
or a new strain of malware. Furthermore, a tree-based feature information storage, (iv) feature extraction and transformation,
transformation is performed to alleviate the data imperfection (v) supervised classifier, and (vi) a UI portal.
2023
2013 Proceedings IEEE INFOCOM
Information Labeled
IDS/IPS
Threat Storage flows
label
Training
Packets features
Data Capture Feature Feature Supervised
(STA) Extraction Transformation Classifier
Unlabeled
flows Classifier
output
2024
2013 Proceedings IEEE INFOCOM
2025
2013 Proceedings IEEE INFOCOM
instances from class i as possible. The outer sphere is con- equally important when determining the similarity between
structed in such a way that instances that do not belong to instances. To circumvent this limitation, our framework uses
class i lie outside of it. The radial distance between the two the tree-based features described in the previous section to
hyperspheres is called the classifier’s margin, which defines the learn a weighted kernel matrix for building the hyperspheres.
objective function to be optimized by the 1-class SVM learning The kernel matrix is induced from a ground truth kernel G,
algorithm. For example, the hypersphere for predicting class which is defined as follows:
k is obtained by solving the following [12]:
+1 if yi = yj ,
C C G(xi , xj ) =
min Rk2 − M d2k + ξi + ξl −1 otherwise
Rk ,ak ,dk ,ξi ,ξl Nk Nk̄
i:yi =k l:yl =k
Intuitively, the goal here is to align the weighted linear
subject to xi − ak 2 ≤ Rk2 + ξi , ∀i : yi = k kernel K(x, x ) = xT W x such that it is consistent with the
xl − ak 2 ≥ Rk2 + d2k − ξl , ∀l : yl = k class similarity matrix between training instances G. This is
ξi ≥ 0, ∀i : yi = k accomplished by minimizing the following objective function:
λ
ξl ≥ 0, ∀l : yl =
k (2) L=− Gij (XW X T )ij + Wij2 (4)
ij
2
where Nk and Nk̄ are the number of instances that belong
to the k th class and its complement, respectively. Here, the The first term is minimized when the matrices G and XW X T
positive examples are enumerated by indices i and the negative are in agreement with each other. The second term Wij2 is a
examples by indices l. ξi and ξl are the slack variables while regularizer term added to keep the model parsimonious1 . We
C ≥ 0 controls the penalty for misclassification errors. The solve for W by taking the partial derivative of the objective
variables ak and Rk represent the center and radius
of the function with respect to W and equating it to zero.
hypersphere constructed for the k th class while (Rk2 + d2k )
∂L
defines the radius of the outer hypersphere.
It can shown =− Gij (Xip Xqj
T
) + λWpq = 0 (5)
that the margin width is given by (Rk2 + d2k ) − Rk . To Wpq ij
maximize the margin, we need to maximize d2k and minimize
Rk2 simultaneously, and the parameter M ≥ 0 controls the After rearranging the equation, we have W = X T GX. Thus
trade-off between those two terms. The Wolfe dual form of the similarity between any test instance x∗ to a training
the preceding optimization problem is given below: instance x is given by the following weighted linear kernel
WL Kernel : K(x, x∗ ) = sign(xT W x∗ )
maxα αi K(xi , xi ) − αl K(xl , xl )
i:yi =k l:yl =k The weighted linear kernel will be used to learn the corre-
− αi αj K(xi , xj ) sponding hyperspheres for each 1-class SVM micro-classifier.
i,j;yi ,yj =k The next section explains how the hyperspheres are collec-
tively used to determine whether an incoming flow is a known
− αl αm K(xl , xm )
malware, a variant of the known malware, or a completely new
l,m:yl ,ym =k
malware strain.
+2 αi αl K(xi , xl )
i,l:yi =k,yl =k C. Probabilistic Class-based Profiling for New Malware Dis-
C covery
subject to αi = 1 + M, 0 ≤ αi ≤
Nk The preceding 1-class SVM formulation constructs hyper-
i:yi =k
spheres based on the classes observed in the training set. Such
C
αl = M, 0 ≤ αl ≤ (3) an approach would fail to detect new variants of malware
Nk̄ due to the absence of instances from the new malware class
l:yl =k
in the training set. This section describes our enhancement
where K(x, x ) is a kernel function that measures the similar-
to the formulation to enable the discovery of new classes.
ity between the pair of instances, x and x . Among the widely
Specifically, new malware will be detected in two ways.
used kernel functions include
First, consider a new malware whose flow-based charac-
x − x 2 teristics are significantly different than those of the existing
RBF Kernel: K(x, x ) = exp −
2σ 2 malware. In order to detect such malware, we first compute the
Polynomial Kernel: K(x, x ) = (1 + xT x )q distance of a test instance to the centers of each hypersphere.
If the test instance lies outside the hyperspheres for all the
The optimization problem given in (3) can be solved using classes, it is predicted to be a new class.
quadratic programming. One problem with using standard Second, consider a new malware that is a variant of existing
kernel function such as RBF and polynomial is that they ones, cleverly manipulated to avoid detection. Due to the
assume the feature vectors x do not contain any missing
values. In addition, the features are uncorrelated and are 1 We set λ = 1 for our experiments.
2026
2013 Proceedings IEEE INFOCOM
inequality constraints imposed when building the hyperspheres Algorithm 2 Detection of New and Known Malwares
(see Equation (2)), we expect instances of the known malware Input: C = {(ai , Ri )}ki=1 , where ai is the center and Ri
to reside only within a single hypersphere. Thus, if an instance is the radius of the hypersphere for class i.
is located near the center of several hyperspheres, this is Π = {(μi , σi )}ki=1 , set of profile distributions .
potentially a new variant of malware evolving from existing x, input test instance.
ones. Nevertheless, there are also cases where instances of Output: y, predicted label for instance x.
known malware are enclosed within its own hypersphere as
well as the hyperspheres for a few other classes (though, P =∅
they are likely to be located far away from the center of the for i = 1 to k do
hyperspheres for other classes). By comparing the distance d[i] = x − Ci 2
profile of the instances to all the hyperspheres, we conjecture if d[i] < Ri then
that it is possible to distinguish a known malware from P = P ∪ {d[i]}
variants of the malware. However, comparing the distances if P = ∅ then
to the centers of different hyperspheres is not meaningful as return (N ewClass)
each hypersphere is defined in a completely different tree- else
transformed subspace. for i = 1 to k do
To overcome this problem, we propose a technique that pvalues[i] = SignificanceLevel(d[i], μi , σi )
compares the distance distribution of a test instance to the
score = ComputeScore(pvalues)
centers of all hyperspheres. Let (aj , Rj ) be the center and
if score > τ then
radius of the hypersphere constructed for class j and Dj
return (N ewV ariantClass)
denote the set of radial distances between training instances
else
from class j to the center of their corresponding hypersphere:
y = argmaxi pvalues[i];
Dj = {di = xi − aj 2 |yi = j; di ≤ Rj }. return y
2027
2013 Proceedings IEEE INFOCOM
TABLE II
N UMBER OF TRAINING AND TEST INSTANCES FOR EACH MALWARE CLASS . categorized into one of the 38 known classes. It should be
noted that some of the “good” flows may actually be malicious
Class ID (Name) # Train # Test as they were not detected by the current IDS system. Thus, our
0 (good) 2103 210402
1 (MS SQL Server 2000 Attack) 151 151 experimental study reports only the classification performance
2 (Trojan Asprox) 0 3 on the 38 classes since the performance on the “good” class
3 (Fake App / AV Website) 0 1 may not be reliable. The data is partitioned into two disjoint
4 (Gammima Request Activity) 0 2
5 (MSN Messenger Login Attack) 0 4 sets, one for model building (training set) and the other for
6 (Tor Activity) 0 3 model evaluation (test set). To simulate new class discovery,
7 (Trojan Sasfis) 0 6 some of the malware classes were withheld from the training
8 (Trojan zbot Post Install) 0 18
9 (W32 Downadup Downloader) 68 67 set and appear only in the test set. More specifically, the
10 (W32.Rontokbro) 0 1 training set contains 2,103 malicious flows from the 12 most
11 (W32.Sality Activity 2) 0 4 prevalent malware classes with an equal number of flows
12 (W32.Sality Activity Download) 0 2
13 (W32.Sality Activity) 94 94 belonging to the “good” class. The remainder of the network
14 (Zbot Activity) 0 4 flows were then assigned to the test set. The test set includes
15 (Hotbar Updates) 0 1 2,099 malicious flows belonging to the 12 malware classes in
16 (Jabber IM Client Connection) 29 29
17 (MSN Messenger Login Attempt) 28 28 the training set as well as 192 flows belonging to 24 “new”
18 (Malicious Site) 0 11 malware classes that are not present in the training set. The
19 (NetBIOS NBStat Query) 72 71 class distribution is summarized in Table II.
20 (P2P Ares Client Connection) 0 13
21 (P2P BitTorrent Activity 1) 750 750
22 (P2P BitTorrent Activity 2) 685 685 B. Comparing Tree-Based Feature Transformation against
23 (P2P Downadup Activity0 98 98 Missing Value Imputation
25 (P2P Edonkey Ping Message) 0 22
26 (P2P Gnutella Connection) 56 55 Our framework employs a collection of one-class SVM
27 (Skype Requesting Updates) 0 12 classifiers to distinguish each type of malware from other
28 (Bot C&C Connection 2) 0 3 types in the data set. Since one-class SVM relies on the
29 (Bot C&C Connection 3) 0 2
30 (Bot C&C Connection) 27 27 similarity values (encoded in the kernel matrix) computed
32 (HTTP Tidserv Download Req) 0 1 between every pair of instances in the data set, the performance
33 (Tidserv Activity 2) 0 1 of the classifier can be degraded by poor handling of any
34 (Tidserv Activity) 0 35
35 (Trojan SpyEye Activity 1) 0 4 imperfections in the data (such as noise, missing values, or
36 (Phoenix Toolkit File Download) 0 1 collinearity among the features). To address this problem, we
37 (Yahoo! IM Login Attack) 45 44 perform a nonlinear transformation on the original feature
38 (Unknown 1) 0 38
space using the tree-based approach described in Section III-A.
To validate the effectiveness of this strategy, we compared
our tree-based transformation approach against standard data
of continuous and categorical types), (2) correlated features, imputation methods used for handling missing values in the
and (3) missing values due to inapplicable features (e.g., the data.
features sz9 and sz10 are inapplicable to flows that contain less Ideally, a reliable kernel matrix should reflect the class
than 9 packets). similarity between the data instances, i.e., instances that belong
We use a reputed commercial IDS/IPS system to generate to the same class should be more similar to each other than
the class label for each flow by analyzing their corresponding to those belonging to other classes. One way to evaluate the
payload. The IDS will label a flow if its payload matches effectiveness of the various approaches in handling the data
a defined signature of a class. In our data, the IDS system imperfection issues is by providing their kernel matrices to the
has identified 38 different types of malicious flows listed in 1-nearest neighbor classifier and compare their classification
Table II. The flows that were unlabeled by the IDS system performance. A summary of the methods used to compute the
are assigned to “good” (unknown) category. Majority of these kernel matrix for the network flow data is given below:
38 classes are known high-risk malwares, including Sality, • Original: The original feature values generated by Narus
Conficker (Downadup), Tidserv, Trojans, Fake AV Attacks, C Semantic Traffic Analyzer as shown in Table I are non-
& C Bots, etc. However, this table also includes some mid negative. Missing values can be annotated as -1 or other
or low-risk suspicious activities, such as Failed IM login invalid values that lie outside the range of permissible
attempts, Hotbar Updates Activity, etc. and unwanted user values for that feature. In this approach, features that
activities, such as, P2P BitTorrent Activity. We have included contain missing values are ignored when computing the
such classes in our data as often System Administrators of distance or similarity between two data instances. For
Enterprise networks identify such activities and try to block example, the radial basis function kernel between two
them. So, it will be really beneficial if we can identify such instances x and x are given as follows
classes along with the more malicious malwares.
2
The whole dataset contains 216,899 flows, out of which xi =−1,xi =−1 (xi − xi )
K(x, x ) = exp − ,
only 4,394 of them ( 2%) were labeled as malicious and 2σ 2 Nxx
2028
2013 Proceedings IEEE INFOCOM
TABLE III
C ONFUSION MATRIX FOR MACRO - LEVEL CLASSIFICATION USING
RANDOM FOREST.
(e.g., via cross validation). The RBF kernel function can 0.7
True Positive Rate
2029
2013 Proceedings IEEE INFOCOM
TABLE IV
C OMPARISON BETWEEN F1 MEASURE FOR DETECTING KNOWN AND NEW proposed to effectively detect the malware classes. Finally,
CLASSES USING RANDOM FORESTS AND 1- CLASS SVM WITH we present a novel adaptation of 1-class SVM to identify new
TREE - BASED FEATURES .
types of malware.
without Profile with Profile For future work, we plan to extend the formulation to an
Class RBF WL RBF WL Random Forest online learning setting. In addition, when the number of classes
New 0.38 0.46 0.42 0.50 0.00 becomes extremely large, the prediction step is expensive due
Known Classes
1 1.00 1.00 1.00 1.00 0.99
to the large number of hyperspheres that must be tested. To
9 1.00 1.00 0.99 1.00 1.00 address this problem, we plan to develop a hierarchical multi-
13 0.86 0.85 0.80 0.87 0.71 class learning approach where the hyperspheres are organized
16 0.98 0.98 0.91 0.98 0.96
17 0.93 0.93 0.70 0.93 0.89
in a hierarchical structure, thereby reducing the number of
19 0.99 0.88 0.93 0.88 0.98 hyperspheres to be tested when predicting the class label of a
21 0.93 0.90 0.89 0.90 0.80 test instance.
22 0.98 0.97 0.96 0.97 0.90
23 0.97 0.90 0.98 0.90 0.98 R EFERENCES
26 0.98 0.96 0.93 0.96 0.97
30 0.68 0.54 0.62 0.26 0.51 [1] M. Christodorescu, “Semantics-aware malware detection,” in IEEE Sym-
37 0.96 0.99 0.48 0.99 0.92 posium on Security and Privacy, 2005, pp. 32–46.
[2] N. Idika and A. P. Mathur, “A survey of malware detection techniques,”
Purdue University, Technical Report SERC-TR-286, 2007.
[3] A. Sung, J. Xu, P. Chavez, and S. Mukkamala, “Static analyzer of vicious
original features (with mean value imputation), (2) RBF-tree executables,” in Annual Computer Security Appl Conf, 2004, pp. 326–
334.
and WL-tree correspond to 1-class SVM models constructed [4] J. Kephart and B. Arnold, “Automatic extraction of computer virus signa-
from the tree-based features (without profiling), and (3) RBF- tures,” in Proceedings of the 4th Virus Bulletin Internationl Congerence,
tree-profile and WL-tree-profile correspond to 1-class SVM 1994, pp. 178–184.
[5] T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan, “Detection
models with tree-based features and profiling (for new class of new malicious code using n-grams signatures,” in Proc of the 2nd
discovery). Clearly, the tree-based features give higher area Annual Conf on Privacy, Security, and Trust, 2004, pp. 193–196.
under the ROC curve (AUC) compared to the mean imputed [6] A. Lakhotia, A. Kapoor, and E. Kumar, “Are metamorphic viruses really
invincible,” Virus Bulletin, vol. 12, p. 57, 2004.
raw features. The profiling technique also gives significantly [7] M. Bailey, J. Andersen, Z. Morleymao, and F. Jahanian, “Automated
higher AUC than those without profiling. These results vali- classification and analysis of internet malware,” in In Proc of Recent
dated the effectiveness of the proposed tree transformation and Advances in Intrusion Detection, 2007.
[8] V. Chandola, E. Eilertson, L. Ertoz, G. Simon, and V. Kumar, Data
profiling techniques for 1-class SVM. mining and cyber security. Springer, 2006, pp. 83–103.
In addition, we also evaluated the performance of the [9] G. Gu, R. Perdisci, J. Zhang, and W. Lee, “Botminer: Clustering
classifier on the known malware. For this experiment, we set analysis of network traffic for protocol and structure independent botnet
detection,” in USENIX Security Symposium, 2008, pp. 139–154.
the threshold τ for new class detection in such a way that [10] N. Weaver, V. Paxson, S. Staniford, and R. Cunningham, “A taxonomy
yields a 10% false alarm rate. We compared the performance of computer worms,” in Proc of the 2003 ACM Workshop on Rapid
of the 1-class SVM models against the random forest classifier Malcode, 2003, pp. 11–18.
[11] A. H. Sung and S. Mukkamala, “Identifying important features for
(trained on all the known malware classes). Table IV shows intrusion detection using support vector machines and neural networks,”
the precision and recall values for each class along with in Symposium on Applications and the Internet (SAINT), 2003.
their corresponding F 1 measure. The results suggest that the [12] P.-Y. Hao, J.-H. Chiang, and Y.-H. Lin, “A new maximal margin spheri-
cal structured multi class support vector machine,” Applied Intelligence,
supervised weighted linear kernel gives the highest F1 measure vol. 30, pp. 98–111, 2009.
for new class detection without compromising the F1 values of [13] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.
the known classes. In addition, the profiling method increases 5–32, 2001.
the F1 measure for new class detection for both RBF and WL
kernels.
V. C ONCLUSION
This paper presents a malware detection approach based
on features derived from the layer 3 and layer 4 network
flow characteristics. Though the features are more resilient
to payload encryption, they have other issues (missing values
and correlated features) that hamper the applicability of more
sophisticated learning algorithms. Furthermore, the supervised
nature of the algorithm makes it difficult for detecting new
malware. Our proposed approach addresses these challenges
and identify flows of existing and novel malwares with very
high precision. First, a tree-based feature transformation ap-
proach is developed to handle the data imperfection issues.
A weighted linear kernel based on tree-based features is also
2030