You are on page 1of 13

US008239948B1

(12) United States Patent (10) Patent N0.: US 8,239,948 B1


Grif?n et al. (45) Date of Patent: Aug. 7, 2012

(54) SELECTING MALWARE SIGNATURES TO 16, 2010] Retrieved from the Internet<URL:http://ieeexplore.ieee.
REDUCE FALSE-POSITIVE DETECTIONS org/ie15/9088/28847/01297626.pdt?isnumber:288 47ICNF
&amumber:1297626&arSt:+600&ared:+603
(75) Inventors: Kent E. Grif?n, Santa Monica, CA &arAuthor:Peter+Shaohua+Deng%3B+Jau
(US); Tzi-cker Chiueh, Los Angeles, HWang+Wang%3B+Wen-Gong+Shieh%3B+Chin
CA (US); Scott Schneider, Santa Pin+Yen%3B+Cheng-Tan+Tung>.
Monica, CA (US); Xin Hu, AnnArbor, Faulty NOD32 Virus Signature Triggers Alarm When Visiting Web,
MI (US) Jul. 2, 2007, The H Security, 5 pages, [Online] [Retrieved on Mar. 12,
2010] Retrieved from the Internet<URL :http://WWW.heise-online.co.
(73) Assignee: Symantec Corporation, Mountain View, uk/security/Faulty-NOD32-virus-signature-triggers-alarm-When
CA (US) visiting-Web--/neWs/92032>.
Kephart, J .O. et al., Automatic Extraction of Computer Virus Sig
(*) Notice: Subject to any disclaimer, the term of this natures, [Online] [Retrieved on Mar. 12, 2010] Retrieved from the
Internet<URL :http ://WWW. research. ibm. com/antiviru s/ SciPapers/
patent is extended or adjusted under 35 Kephart/VB94/vb94 .htrnl>.
U.S.C. 154(b) by 757 days. Kumar, S. et al., A Generic Virus Scanner in C++, Sep. 17, 1992, pp.
1-21, [Online] [Retrieved on Mar. 12, 2010] Retrieved from the
(21) Appl.No.: 12/340,420 Internet<URL :https ://WWW. cerias .purdue .edu/assets/pdf/bibtexi
archive/92 -0 1 .pdf.
(22) Filed: Dec. 19, 2008 Landesman, M., What is aVirus Signature? About.com, [Online]
[Retrieved on Mar. 12, 2010] Retrieved from the
(51) Int. Cl. Internet<URL:http://antivirus.about.com/od/Whatisavirus/a/virus
G06F 11/00 (2006.01) signature.htrn>.
(52) US. Cl. ........................... .. 726/24; 726/22; 707/200
Stallings, W., Cryptography and Network Security: Principles and
Practice, [Online] [Retrieved on Mar. 12, 2010] Retrieved from the
(58) Field of Classi?cation Search .................. .. 726/24, Internet<URL :http://books. goo gle.com/b00kS?1Cl:SEWCT725 lRgC
726/22; 707/200 &pg:PA61 1&lpg:PA61 1&dq:generic+code+virus+ signature
See application ?le for complete search history. &sourceqveb&ots:EWvSmv372Q
&sig:cnjxFOHZyAaPD9y0AW1gMiFo6WM&hl:en&sa:X
(56) References Cited &oi:bookiresult&resnum?i&ct:result>.

U.S. PATENT DOCUMENTS (Continued)


7,950,059 B2 * 5/2011 Aharon et al. ................ .. 726/24
Primary Examiner * Haresh N Patel
2006/0161986 A1* 7/2006 Singh et al. .. 726/24
2007/0038677 A1 * 2/2007 Reasor et al. 707/200 Assistant Examiner * Dao Ho
2007/0192863 A1* 8/2007 Kapoor et al. .. .. 726/23 (74) Attorney, Agent, or Firm * FenWick & West LLP
2009/0320133 A1* 12/2009 Viljoen et a1. .. .. 726/24
2009/0328221 A1* 12/2009 Blum?eld et al. .. .. 726/24
2010/0150006 A1* 6/2010 PourZandi et a1. .......... .. 370/252
(57) ABSTRACT
A set of candidate signatures for a malicious software (mal
OTHER PUBLICATIONS
Ware) is generated. The candidate signatures in the set are
Davis, D., How Can Cisco Automatic Signature Extraction Prevent scored based on features that indicate the signatures are more
Zero-Day Virus Attacks? Aug. 15, 2008, TechRepublic, 5 pages, unique and thus less likely to generically occur non-malicious
[Online] [Retrieved on Mar. 12, 2010] Retrieved from the programs. A malWare signature for the malWare entity is
Internet<URL:http://blogs.techrepublic.com.com/networking/ selected from among the candidate malWare signatures based
?p:628>. on the scores. The selected malWare signature is stored.
Deng, P.S. et al., Intelligent Automatic Malicious Code Signatures
Extraction, IEEE, 2003, pp. 600-603, [Online] [Retrieved on Mar. 20 Claims, 5 Drawing Sheets

Client Client
m m

Client
m

Security Server
m
US 8,239,948 B1
Page 2

OTHER PUBLICATIONS Retrieved from the Internet<URL:http://WWW.agusbl0g.c0m/


d / h t-' - - ' - ' tu - -th -t'l1- d-3.htm>.
What is aVirus Signature? Are They Still Used? Agustins Blog, WOT press W a ls a Vlrus Slgna re are ey S 1 use
Aug. 10, 2008, 9 pages, [Online] [Retrieved on Mar. 12, 2010] * Cited by examiner
US. Patent Aug. 7, 2012 Sheet 1 of5 US 8,239,948 B1

E26 3
E26 3

E26 3 65:3m0a a
US. Patent Aug. 7, 2012 Sheet 2 0f 5 US 8,239,948 B1

mg
L].
US. Patent Aug. 7, 2012 Sheet 3 0f 5 US 8,239,948 B1

310 320
GOODWARE MALWARE
DATABASE DATABASE

340
33 GOODWARE
FEATURES ANALYSIS
DATABASE MODULE

350 360
MALWARE MALWARE
SIGNATURE SIGNATURE
DATABASE MODULE

1 10
SECURITY
SERvER

FIG. 3
US. Patent Aug. 7, 2012 Sheet 4 0f 5 US 8,239,948 B1

410
420
CANDIDATE
SIGNATURE
SIGNATURE
SCORING
DETERMINATION
MODULE
MODULE

430
SIGNATURE
SELECTION
MODULE

360
MALWARE SIGNATURE
MODULE

FIG. 4
US. Patent Aug. 7, 2012 Sheet 5 of5 US 8,239,948 B1

GENERATE CANDIDATE
MALWARE SIG NATU RES
lK-S'IO

I
SCORE CANDIDATE Ir- 520
MALWARE SIGNATURES

I
SELECT MALWARE r- 530
SIGNATURE FROM I
CANDIDATE MALWARE
SIGNATURES

I
STORE SELECTED MALWARE 540
SIGNATURE I

FIG. 5
US 8,239,948 B1
1 2
SELECTING MALWARE SIGNATURES TO The code comprises a candidate signature determination
REDUCE FALSE-POSITIVE DETECTIONS module con?gured to generate a set of candidate signatures
for the malWare entity. The code further comprises a signature
BACKGROUND OF THE INVENTION scoring module con?gured to score each candidate signature
in the set. The score for a candidate signature indicating a
1. Field of the Invention likelihood of Whether features present in the candidate signa
This invention pertains in general to computer security and ture are found in a set of non-malicious softWare. The code
in particular to the development of signatures to accurately further comprise a signature selection module con?gured to
identify malicious softWare. select a malWare signature for the malWare entity from among
2. Description of the Related Art the candidate signatures in the set based on the scores and
There is a Wide variety of malicious software (malWare) store the selected malWare signature.
that can attack modern computers. MalWare threats include Embodiments of the computer-implemented system com
computer viruses, Worms, Trojan horse programs, spyWare, prise a computer processor and a computer-readable storage
adWare, crimeWare, and phishing Websites. Modern malWare medium storing computer program modules con?gured to
is often designed to provide ?nancial gain to the attacker. For execute on the computer processor. The computer program
example, malWare can stealthily capture important informa modules comprise a candidate signature determination mod
tion such as logins, passWords, bank account identi?ers, and ule con?gured to generate a set of candidate signatures for the
credit card numbers. Similarly, the malWare can provide hid malWare entity. The computer program modules further com
den interfaces that alloW the attacker to access and control the prise a signature scoring module con?gured to score each
compromised computer. 20 candidate signature in the set, the score for a candidate sig
One method used to detect malWare is to identify malWare nature indicating a likelihood of Whether features present in
signatures. MalWare signatures contain data describing char the candidate signature are found in a set of non-malicious
acteristics of knoWn malWare and are used to determine softWare. Additionally, the computer program modules com
Whether an entity such as a computer ?le or a softWare appli prise a signature selection module con?gured to select a mal
cation contains malWare. Typically, a set of malWare signa 25 Ware signature for the malWare entity from among the candi
tures is generated by a provider of security software and is date signatures in the set based on the scores and store the
deployed to security softWare on a users computer. This set selected malWare signature.
of malWare signatures is then used by the security softWare to The features and advantages described in this summary and
scan the users computer for malWare. the folloWing detailed description are not all-inclusive. Many
During malWare signature generation, malWare signatures 30 additional features and advantages Will be apparent to one of
are typically validated against entities that are knoWn to not ordinary skill in the art in vieW of the draWings, speci?cation,
contain malWare, herein referred to as goodWare, in order to and claims hereof.
ensure that the malWare signatures do not generate false posi
tive identi?cations of malWare. In other Words, the malWare BRIEF DESCRIPTION OF THE DRAWINGS
signatures are validated to ensure they do not falsely identify 35
goodWare as malWare. Typically, a malWare signature is ?rst FIG. 1 is a high-level block diagram of a computing envi
generated by a security analyst or a computer and then com ronment according to one embodiment.
pared to a dataset of goodWare in order to determine Whether FIG. 2 is a high-level block diagram illustrating a typical
the malWare signature generates false positive identi?cations computer for use as a security server or a client according to
of malWare. Due to the large siZe of the dataset of all knoWn 40 one embodiment.
goodWare and the rapidly expanding number of malWare FIG. 3 is a high-level block diagram illustrating a detailed
?les, generating malWare signatures for the malWare ?les and vieW of a security server according to one embodiment.
comparing these signatures to a dataset of goodWare to iden FIG. 4 is a high-level block diagram illustrating a detailed
tify malWare signatures that do not result in false positive vieW of a malWare signature module according to one
identi?cations in malWare has become increasingly dif?cult. 45 embodiment.
Accordingly, there is a need in the art for Ways to generate FIG. 5 is a ?owchart illustrating steps performed by the
malWare signatures that are unlikely to cause false positive security server to determine a malWare signature for mali
detections. cious software according to one embodiment.
The ?gures depict an embodiment of the present invention
BRIEF SUMMARY 50 for purposes of illustration only. One skilled in the art Will
readily recogniZe from the folloWing description that altema
The above and other needs are met by a computer-imple tive embodiments of the structures and methods illustrated
mented method, a computer program product and a computer herein may be employed Without departing from the prin
system for selecting a signature for a malWare entity. One ciples of the invention described herein.
embodiment of the computer-implemented method generates 55
a set of candidate signatures for the malWare entity. The DETAILED DESCRIPTION
computer-implemented method scores each candidate signa
ture in the set. The score for a candidate signature indicates a FIG. 1 is a high-level block diagram of a computing envi
likelihood of Whether features present in the candidate signa ronment 100 according to one embodiment. FIG. 1 illustrates
ture are found in a set of non-malicious softWare. The com 60 a security server 110 and three clients 150 connected by a
puter-implemented method selects a malWare signature for netWork 114. Only three clients 150 are shoWn in FIG. 1 in
the malWare entity from among the candidate signatures in order to simplify and clarify the description. Embodiments of
the set based on the scores. The computer-implemented the computing environment 100 can have thousands or mil
method then stores the selected malWare signature. lions of clients 150 connected to the netWork 114.
Embodiments of the computer program product have a 65 Generally, the security server 110 generates malWare sig
computer-readable storage medium storing computer-ex natures for knoWn malWare. A signature is any characteristic
ecutable code for selecting a signature for a malWare entity. such as a pattern, metadata or sequence associated With an
US 8,239,948 B1
3 4
entity (e. g., software applications or executable ?les) that can larly, the netWorking protocols used on the netWork 114 can
be used to accurately identify the entity as malWare. In the include the transmission control protocol/Internet protocol
embodiments discussed herein, a malWare signature for (TCP/IP), the hypertext transport protocol (HTTP), the
detecting a particular malWare entity contains a sequence of simple mail transfer protocol (SMTP), the ?le transfer proto
code derived from that entity. In order to generate the malWare col (FTP), etc. The data exchanged over the netWork 114 can
signature, the security server 110 analyZes the malWare to be represented using technologies and/or formats including
determine multiple candidate malWare signatures from the the hypertext markup language (HTML), the extensible
malWare itself. In one embodiment, the candidate malWare markup language (XML), etc. In addition, all or some of links
signatures are sequences of code that refer to an ordered set of can be encrypted using conventional encryption technologies
one or more data elements, such as computer processor such as the secure sockets layer (SSL), Secure HTTP and/or
instructions, occurring Within the malWare. virtual private netWorks (VPNs). In another embodiment, the
The security server 110 scores the candidate malWare sig entities can use custom and/or dedicated data communica
natures associated With the malWare in order to select one or tions technologies instead of, or in addition to, the ones
more signatures for deployment to the clients 150. In one described above.
embodiment, the candidate signatures are scored based on FIG. 2 is a high-level block diagram illustrating a typical
features present in the candidate signatures. The features computer 200 for use as a security server 110 or client 150.
represent characteristics (of, e.g., computer instructions) that Illustrated are a processor 202 coupled to a bus 204. Also
also appear in goodWare. The score assigned to a candidate coupled to the bus 204 are a memory 206, a storage device
malWare signature indicates the likelihood that the features 208, a keyboard 210, a graphics adapter 212, a pointing
present in the candidate malWare signature are also present in 20 device 214, and a netWork adapter 216. A display 218 is
goodWare, With a higher score indicating that the features are coupled to the graphics adapter 212.
less likely to be present. In other Words, the score assigned to The processor 202 may be any general-purpose processor
a candidate malWare signature represents an interesting such as an INTEL x86 compatible-CPU. The storage device
ness or uniqueness metric that represents a likelihood that 208 is, in one embodiment, a computer-readable storage
the same features are not present in the goodWare. By scoring 25 medium such as a hard disk drive but can also be any other
the candidate malWare signatures in this manner, it is possible device capable of storing data, such as a Writeable compact
to identify candidate malWare signatures that are less likely to disk (CD) or DVD, or a solid-state memory device. The
generate false positive detections in goodWare. In one memory 206 may be any computer-readable storage medium,
embodiment, the highest-scoring candidate signature for a such as, ?rmWare, read-only memory (ROM), non-volatile
piece of malWare is selected for use and deployment to clients 30 random access memory (NVRAM), and/or RAM, and holds
150 because it represents the malWare signature that is least computer executable program instructions and data used by
likely to generate a false-positive detection. the processor 202. The pointing device 214 may be a mouse,
The security server 110 interacts With the clients 150 via track ball, or other type of pointing device, and is used in
the netWork 114. The security server 110 deploys a set of combination With the keyboard 210 to input data into the
malWare signatures to the clients 150. The clients 150 use the 35 computer 200. The graphics adapter 212 displays images and
malWare signatures in conjunction With security softWare to other information on the display 218. The netWork adapter
identify malWare. In one embodiment, the clients 150 execute 216 couples the computer 200 to the netWork 114.
security softWare provided by the security server 110 to scan As is knoWn in the art, the computer 200 is adapted to
the clients 150 for entities such as softWare applications or execute computer program modules. As used herein, the term
?les Which correspond to (e.g., have the sequences found in) 40 module refers to computer program logic and/or data for
the malWare signatures. providing the speci?ed functionality. A module can be imple
In one embodiment, a client 150 is a computer used by one mented in hardWare, ?rmWare, and/or softWare. In one
or more users to perform activities including doWnloading, embodiment, the modules are stored on the storage device
installing, and/ or executing softWare applications. The client 208, loaded into the memory 206, and executed by the pro
150, for example, can be a personal computer executing a Web 45 cessor 202.
broWser such as MICROSOFT INTERNET EXPLORER that The types of computers 200 utiliZed by the entities of FIG.
alloWs the user to retrieve and display content from Web 1 can vary depending upon the embodiment and the process
servers and other computers on the netWork 114. In other ing poWer utiliZed by the entity. For example, a client 150 that
embodiments, the client 150 is a netWork-capable device is a mobile telephone typically has limited processing poWer,
other than a computer, such as a personal digital assistant 50 a small display 218, and might lack a pointing device 214. The
(PDA), a mobile telephone, a pager, a television set-top security server 110, in contrast, may comprise multiple blade
box, etc. For purposes of this description, the term client servers Working together to provide the functionality
also includes computers such as servers and gateWays that described herein.
encounter softWare applications or other entities that might FIG. 3 is a high-level block diagram illustrating a detailed
constitute malWare or other threats. For example, a client 150 55 vieW of the security server 110 according to one embodiment.
can be a netWork gateWay located betWeen an enterprise As shoWn in FIG. 3, the security server 110 includes multiple
netWork and the Internet. modules. Those of skill in the art Will recogniZe that other
The netWork 114 represents the communication pathWays embodiments of the security server 110 can have different
betWeen the security server 110 and clients 150. In one and/or other modules than the ones described here, and that
embodiment, the netWork 114 is the Internet. The netWork 60 the functionalities can be distributed among the modules in a
114 can also utiliZe dedicated or private communications different manner.
links that are not necessarily part of the Internet. In one The goodWare database 310 stores a set of knoWn good
embodiment, the netWork 114 uses standard communications Ware entities referred to as a goodWare dataset. The set of
technologies and/or protocols. Thus, the netWork 114 can goodWare entities can range from one goodWare entity to
include links using technologies such as Ethernet, 802.11, 65 millions of goodWare entities. A goodWare entity is an entity
integrated services digital netWork (ISDN), digital subscriber such as a ?le or softWare application that is knoWn not to be
line (DSL), asynchronous transfer mode (ATM), etc. Simi malWare. The goodWare dataset includes executable ?les of
US 8,239,948 B1
5 6
the goodWare entities that contain executable code formed of Embodiments of the goodWare analysis module 340 can
data and computer processor instructions. analyZe other features of the goodWare in addition to, or
The malWare database 320 stores a set of knoWn malWare instead of, the features described above.
entities referred to as a malWare dataset. A malWare entity is The features database 330 stores data describing features
an entity such as a ?le or software application that exhibits that are interesting in the sense that the features can be used
malicious behavior such as a computer virus or computer to determine Whether a feature found in malWare is likely to
Worm. The set of malWare entities can range from one mal also be present in goodWare. In one embodiment, the features
Ware entity to millions of malWare entities. Similar to the database 330 stores data derived from the goodWare by the
goodWare dataset, the malWare dataset includes executable goodWare analysis module 340. These data include data
?les of the malWare entities that contain executable code. describing the relative frequency of immediate operand val
ues, and data indicating Whether given address offsets are
In one embodiment, a goodWare analysis module 340 ana
considered large.
lyZes features of the executable code of the goodWare in the The features database 330 can also store data from sources
goodWare dataset. According to one embodiment, the good other than the goodWare analysis module 340. For example,
Ware analysis module 340 comprises a disassembler, such as the features database 330 can include a list of math and logic
the IDA PRO disassembler available from Hex-Rays SA of instructions that are interesting because such instructions
Liege, Belgium. The goodWare analysis module 340 uses the occur infrequently in the goodWare or because sequences
disassembler to disassemble the executable ?les in the good containing the instructions are likely to be uncommon. In one
Ware data set in order to translate machine code in the execut embodiment, certain idioms of logic and math instructions
able ?les into assembly language sequences. Disassembling 20 are excluded from the list. For example, although XOR
the goodWare in this manner exposes certain features that are (exclusive or) may often be a logical operation or computa
analyZed by the goodWare analysis module 340. tion of interest, the idiom xor regl, regl used in x86 com
In one embodiment, a feature analyZed by the goodWare puter architecture to set the register regl to Zero is a com
analysis module 340 is the commonality of immediate oper mon instruction and is of no interest. LikeWise, the features
and values. An immediate operand is an operand that is 25 database 330 can include a list of function calls that occur
directly encoded as part of a machine instruction. The good infrequently in goodWare. Other embodiments store data
Ware analysis module 340 analyZes immediate operands describing additional and/ or different features.
occurring in the disassembled goodWare and determines the The malWare signature database 350 stores a set of mal
frequency at Which given values of immediate operands Ware signatures used to detect malWare. As previously men
occur. Certain immediate operand values, such as all ones or
30 tioned, a signature is any characteristic such as a pattern,
metadata or sequence associated With an entity that can be
all Zeros, are likely to occur frequently, While other operand
used to accurately identify that the entity is malware. In the
values are likely to occur less frequently. Thus, a high fre
embodiments discussed herein, the malWare signatures con
quency of occurrence of an immediate operand value in the
tain sequences derived from knoWn malWare entities.
goodWare dataset suggests that the value is a common. In 35 The malWare signature module 360 generates malWare
contrast, a loW frequency of occurrence of a value in the signatures for the knoWn malWare entities in the malWare
goodWare dataset suggests that the value is unusual. The database 320. In one embodiment, the malWare signature
threshold for determining What frequency constitutes a high module 360 generates multiple candidate malWare signatures
or loW frequency of occurrence can be set by a security for a given malWare entity. The malWare signature module
analyst. 40 360 scores the candidate malWare signatures based on the
In one embodiment, the goodWare analysis module 340 signatures interestingness. Then, the malWare signature
determines Whether an immediate operand value is unusual or module 360 selects from among the candidate malWare sig
interesting based on the context in Which the value is used natures based on the scores to select one or more malWare
Within the goodWare. For example, if an immediate operand signatures that are used to detect the malWare entity. In one
value contains a relative or absolute address that is subject to 45 embodiment, the malWare signature module 360 also deploys
relocation or change depending on the location of Where the the selected malWare signatures for the malWare entities to the
executable is loaded in memory, the goodWare analysis mod clients 150. This deployment can occur, for example, When
ule 340 determines that the immediate operand value is less updates are made to the set of malWare signatures, When the
interesting or common since application programs frequently malWare signature module 360 generates and stores a neW
access data and code addresses Within their oWn address 50 malWare signature in the malWare signature database 350, or
space. When requested by a client 150.
Another feature analyZed by the goodWare analysis mod FIG. 4 is a high-level block diagram illustrating a detailed
ule 340 is the address offsets used in [base+offset] addressing vieW of the malWare signature module 360 according to one
by the goodWare stored in the goodWare database 310. In one embodiment. As shoWn in FIG. 4, the malWare signature
embodiment, the goodWare analysis module 340 determines 55 module 360 includes multiple modules. Those of skill in the
these addresses by disassembling goodWare. The goodWare art Will recogniZe that other embodiments of the malWare
analysis module 340 speci?cally examines the siZe of the signature module 360 can have different and/or other mod
offsets to determine the frequency of occurrence of offsets of ules than the ones described here, and that the functionalities
various siZes. By observing the address offsets used in [base+ can be distributed among the modules in a different manner.
offset] addressing by the goodWare, the goodWare analysis 60 A candidate signature determination (CSD) module 410
module 340 can determine the typical (e.g., average) siZe of generates candidate malWare signatures for the malWare enti
offsets and determine Which offset siZes are large. Offsets ties in the malWare database 320. In one embodiment, the
of larger than average siZe typically indicate that the good CSD module 410 uses a disassembler to disassemble an
Ware is indexing into a large data structure. Such large data executable ?le of a malWare entity in order to generate a
structures are often unique to individual goodWare entities. 65 sequence of assembly language instructions. The CSD mod
Thus, the presence of a large address offset is unusual and ule 410 generates candidate malWare signatures formed of
interesting. subsequences of the sequence. In one embodiment, the CSD
US 8,239,948 B1
7 8
module 410 processes the sequence of assembly language arguments on a stack for local function calls or for instruc
instructions using a sliding WindoW of ?xed length to gener tions used to analyZe return values from the local functions
ate (e.g., produce) a set of subsequences representing the calls.
candidate malWare signatures. According to one embodi The signature scoring module 420 further analyZes the
ment, the length of the sliding WindoW is large enough to ?t instructions Within the candidate malWare signature to deter
multiple assembly language instructions, such as 48 bytes. mine Whether the candidate malWare signature includes math
The candidate malWare signatures can be stored in the mal and logic instructions. In one embodiment, a list of such math
Ware database 320 or the malWare signature database 350, and logic instructions is stored in the features database 330.
depending upon the embodiment. For each math and logic instruction in the candidate malWare
The signature scoring module 420 scores the candidate signature, the signature scoring module 420 determines
malWare signatures determined by the CSD module 410. In Whether the instruction appears in the stored list. If the math
one embodiment, the signature scoring module 420 operates and logic instruction is in the list, the candidate malWare
on a set of candidate malWare signatures for a given malWare signature receives a point. Math and logic instructions are
entity in order to alloW for selection of a malWare signature interesting because they typically represent the portions of
based on the scores for the set of candidate malWare signa the code that are performing the Work of the malWare entity
tures. For clarity, this description describes the scoring pro (as opposed, e.g., to performing standard housekeeping func
cess With respect to a single candidate malWare signature, and tions). Therefore, a sequence of math and logic instructions is
it Will be understood that the process can be applied across a intrinsically unlikely to appear in the goodWare dataset and
set of candidate malWare signatures. can be considered interesting.
Generally, the signature scoring module 420 examines fea 20 Additionally, the signature scoring module 420 analyZes
tures of a candidate malWare signature and assigns points to the instructions Within the candidate malWare signature to
the signature based on the presence of certain features it. The determine Whether it contains any unusual address offsets in
signature scoring module 420 sums the points to produce a [base+offset] addressing. For each instance of such address
score for the candidate malWare signature. In one embodi ing performed by the instructions in the candidate malWare
ment, points are assigned for features that are interesting in 25 signature, the signature scoring module 420 accesses the data
the sense that the features are unlikely to occur in the good in the features database 330 to determine Whether the offset
Ware dataset. Thus, the score for the candidate malWare sig value is considered large. The candidate malWare signature
nature represents the signatures overall interestingness receives a point for each instance of the addressing that
and also indicates the likelihood that the features of the can includes a large offset.
didate malWare signature Will not be found in the goodWare 30 In one embodiment, the signature scoring module 420
dataset. In one embodiment, the signature scoring module applies different Weights to the various features of a candidate
420 assigns points based on features including Whether the malWare signature in order to increase or decrease the Weights
candidate malWare signature contains unusual immediate of certain features. In other Words, rather than a candidate
operands, Whether the instructions Within the candidate mal signature receiving only a point for each instance of an
Ware signature make local function calls, Whether the candi 35 unusual immediate operand, a local function call, a logic and
date malWare signature includes logic and math instructions, math instruction, or an unusual address offset, the signature
and Whether the candidate malWare signature includes any scoring module 420 can apply different points based on the
unusual address offsets in instructions performing [base+ feature that is present in the candidate signature. For example,
offset] addressing. the signature scoring module 420 can apply tWo points to the
The signature scoring module 420 analyZes the instruc 40 score for each local function call found Within the candidate
tions Within the candidate malWare signature to identify malWare signature, thereby Weighting local function calls
instructions using immediate operands. For each immediate more than the other features. Similar to the Weighting for
operand found, the signature scoring module 420 accesses the local function calls, the signature scoring module 420 can
data stored in the features database 330 describing the relative apply tWo points for each math and logic instruction Within
frequency of immediate operand values to determine Whether 45 the candidate malWare signature. In one embodiment, infre
the operand found in the instruction Within the candidate quent immediate operands and any unusual address offsets in
malWare signature occurs at a loW frequency. In one embodi [base+offset] addressing are Weighted less than local function
ment, the signature scoring module 420 assigns the candidate calls and math and logic functions. In this embodiment, the
malWare signature a point for each infrequently occurring signature scoring module 420 applies only a point to each
immediate operand value found Within the candidate signa 50 occurrence of an infrequent immediate operand and a point to
ture. each occurrence of an infrequent unusual address offset in the
The signature scoring module 420 also analyZes the candidate malWare signature.
instructions Within the candidate malWare signature to deter The signature selection module 430 selects malWare sig
mine Whether the instructions make local function calls. natures that are used to detect malWare entities stored in the
Local function calls tend to be calls to functions that Were 55 malWare database 320. In one embodiment, the signature
Written speci?cally for the malWare entity and implement selection module 430 selects one or more malWare signatures
core functionality of the malWare entity. Local function calls for deployment to clients 150 that are used to detect a given
are contrasted With system function calls, Which tend to be malWare entity. The selected malWare signature for a given
calls to library functions and other code not speci?c to the malWare represents a signature for the malWare that is least
malWare entity. Thus, a local function call is an indicator of 60 likely to generate a false-positive detection in goodWare.
interestingness because the called local functions are In one embodiment, the signature selection module 430
unlikely to be found in the goodWare dataset. In one embodi selects a malWare signature from among a set of candidate
ment, the signature scoring module 420 assigns the candidate signatures based on the scores associated With the set. In one
malWare signature a point for each local function call occur embodiment, the highest-scoring candidate malWare signa
ring therein. In one embodiment, the signature scoring mod 65 ture for a piece of malWare is selected for use and deployment
ule 420 may also assign a point for instructions associated to clients 150 because it represents the malWare signature that
With local function calls such as instructions used to marshal is least likely to generate a false-positive detection. Altema
US 8,239,948 B1
9 10
tively, the signature selection module 430 applies a score forming the set of candidate signatures from subsequences
threshold to determine one or more malWare signatures for a of the produced sequence.
given malWare. The threshold is used to remove candidate 3. The computer-implemented method of claim 1, Wherein
malWare signatures Which are too generic. Generic candidate the set of features for the computer program instructions
signatures receive feWer points due to the lack of interesting contained Within the candidate signature further comprises a
features. In one embodiment, the highest-scoring candidate local function call made by the computer program instruc
malWare signature from among the candidate malWare signa tions contained Within the candidate signature.
tures scoring above the threshold is selected as the malWare 4. The computer-implemented method of claim 1, Wherein
signature for the piece of malWare. The signature selection the set of features for the computer program instructions
module 430 stores the selected signature for the given mal further comprises a logic and mathematical instruction
Ware in the malWare signature database 350. appearing in the computer program instructions contained
FIG. 6 is a ?owchart illustrating steps performed by the Within the candidate signature.
security server 110 to generate a malWare signature for mal 5. The computer-implemented method of claim 1, Wherein
Ware. Other embodiments perform the illustrated steps in the set of features of the computer program instructions fur
different orders, and/or perform different or additional steps. ther comprises an address offset used in [base+offset]
Moreover, some of the steps can be performed by modules or addressing by the computer program instructions contained
modules other than the security server 110. Within the candidate signature and the method further com
In one embodiment, the security server 110 generates 510 prising:
a set of candidate malWare signatures for a given malWare. determining Whether the address offset exceeds a thresh
The security server 110 scores 520 the candidate malWare 20 old, the threshold determined responsive to address off
signatures in the set based on the presence of features Within sets used by computer program instructions in the set of
the candidate malWare signatures Which are unlikely to be non-malicious softWare.
found in goodWare. The security server 110 selects 530 a 6. The computer-implemented method of claim 1, Wherein
malWare signature from among the candidate malWare signa generating, for each candidate signature, the score comprises:
tures based on the scores associated With the candidate mal 25 Weighting different features in the set of features With
Ware signatures. Once the malWare signature is selected, the different Weights; and
security server 110 stores 540 the selected malWare signature Wherein the score for each candidate signature is based on
in the malWare signature database 350 from Where it can be the different Weighted features in the set of features.
deployed to clients 150. 7. The computer-implemented method of claim 1, Wherein
The above description is included to illustrate to a security 30 selecting a malWare signature comprises:
server 110 according to one embodiment. Other embodi comparing the scores for the candidate signatures in the set
ments the operation of certain embodiments and is not meant to a threshold; and
to limit the scope of the invention. The scope of the invention selecting a candidate signature With a score above the
is to be limited only by the folloWing claims. From the above threshold as the malWare signature.
discussion, many variations Will be apparent to one skilled in 35 8. The computer-implemented method of claim 1, Wherein
the relevant art that Would yet be encompassed by the spirit selecting a malWare signature comprises selecting a candi
and scope of the invention. date signature With a highest score as the malWare signature.
9. A computer program product comprising a non-transi
The invention claimed: tory computer-readable storage medium storing computer
1. A computer-implemented method for selecting a signa 40 executable code for selecting a signature for a malWare entity,
ture for a malWare entity, the method comprising: the code comprising:
using a computer to perform steps comprising: a candidate signature determination module con?gured to
generating a set of candidate signatures for the malWare generate a set of candidate signatures for the malWare
entity; entity;
identifying, for each candidate signature, a set of fea 45 a signature scoring module con?gured to:
tures for computer program instructions contained identify, for each candidate signature, a set of features
Within the candidate signature, the set of features for computer program instructions contained Within
comprising an immediate operand value used by the the candidate signature, the set of features comprising
computer program instructions; an immediate operand value used by the computer
determining, for each candidate signature, Whether the 50 program instructions;
features in the set of features are likely to appear in a determine, for each candidate signature, Whether the
set of non-malicious softWare, the determination features in the set of features are likely to appear in a
comprising determining a frequency at Which the set of non-malicious softWare, the determination
immediate operand value appears in the set of non comprising determining a frequency at Which the
malicious softWare; 55 immediate operand value appears in the set of non
generating, for each candidate signature, a score for the malicious softWare;
candidate signature responsive to Whether the fea generate, for each candidate signature, a score for the
tures of the computer program instructions are likely candidate signature responsive to Whether the fea
to appear in the set of non-malicious softWare; tures of the computer program instructions are likely
selecting a malWare signature for the malWare entity 60 to appear in the set of non-malicious softWare; and
from among the candidate signatures in the set based a signature selection module con?gured to select a mal
on the scores; and Ware signature for the malWare entity from among the
storing the selected malWare signature. candidate signatures in the set based on the scores and
2. The computer-implemented method of claim 1, Wherein store the selected malWare signature.
generating a set of candidate signatures comprises: 10. The computer program product of claim 9, Wherein the
producing a sequence of computer program instructions candidate signature determination module is further con?g
that represents the malWare entity; and ured to:
US 8,239,948 B1
11 12
produce a sequence of computer program instructions that comprising determining a frequency at Which the
represents the malWare entity; and immediate operand value appears in the set of non
form the set of candidate signatures from subsequences of malicious softWare;
the produced sequence. generate, for each candidate signature, a score for the
11. The computer program product of claim 9, Wherein the candidate signature responsive to Whether the fea
signature scoring module is further con?gured to: tures of the computer program instructions are
Weight different features in the set of features With different likely to appear in the set of non-malicious soft
Weights; and Ware; and
Wherein the score for each candidate signature is based on a signature selection module con?gured to select a mal
the different Weighted features in the set of features. Ware signature for the malWare entity from among the
12. The computer program product of claim 9, Wherein the candidate signatures in the set based on the scores and
signature selection module is further con?gured to: store the selected malWare signature.
compare the scores for the candidate signatures in the set to 16. The computer-implemented system of claim 15,
a threshold; and Wherein the candidate signature determination module is fur
select a candidate signature With a score above the thresh ther con?gured to:
old as the malWare signature. produce a sequence of computer program instructions that
13. The computer program product of claim 9, Wherein the represents the malWare entity; and
set of features for the computer program instructions con form the set of candidate signatures from subsequences of
tained Within the candidate signature further comprises a the produced sequence.
local function call made by the computer program instruc 20 17. The computer-implemented system of claim 15,
tions contained Within the candidate signature. Wherein the signature scoring module is further con?gured to:
14. The computer program product of claim 9, Wherein the Weight different features in the set of features With different
set of features for the computer program instructions further Weights; and
comprises a logic and mathematical instruction appearing in Wherein the score for each candidate signature is based on
the computer program instructions contained Within the can 25 the different Weighted features in the set of features.
didate signature. 18. The computer-implemented system of claim 15,
15. A computer system for selecting a signature for a mal Wherein the signature selection module is further con?gured
Ware entity, the system comprising: to:
a computer processor; and compare the scores for the candidate signatures in the set to
a computer-readable storage medium storing computer 30 a threshold; and
program modules con?gured to execute on the computer select a candidate signature With a score above the thresh
processor, the computer program modules comprising: old as the malWare signature.
a candidate signature determination module con?gured 19. The computer system of claim 15, Wherein the set of
to generate a set of candidate signatures for the mal features for the computer program instructions contained
Ware entity; 35 Within the candidate signature further comprises a local func
a signature scoring module con?gured to: tion call made by the computer program instructions con
identify, for each candidate signature, a set of features tained Within the candidate signature.
for computer program instructions contained 20. The computer system of claim 15, Wherein the set of
Within the candidate signature, the set of features features for the computer program instructions further com
comprising an immediate operand value used by 40 prises a logic and mathematical instruction appearing in the
the computer program instructions; computer program instructions contained Within the candi
determine, for each candidate signature, Whether the date signature.
features in the set of features are likely to appear in
a set of non-malicious softWare, the determination

You might also like