You are on page 1of 12

Automatic Virus Analysis in the Digital Immune

System

IBM T. J. Watson Research Center
30 Saw Mill River Rd.
Hawthorne, NY 10532
U.S.A.
email: swimmer@acm.org

Abstract
Automatic virus analysis is an important component of the IBM/Sym-
antec Digital Immune System[?]. It attempts to determine whether a
given object is a computer virus and creates antigen if it is. It must do so
without human intervention in order to respond to a virus threat faster
than the virus can spread.
Using this system, we are able to respond to new viruses automatically.
In this paper, I discuss our implementation and why it works so well for
this type of security threat. I end by discussing the barriers that need to
be overcome in order deal with other security threats in a similar manner.

1 Malware
The Digital Immune System is designed specifically to deal with the threat of
computer viruses. However, viruses are merely a subset of all malicious software
(Malware), which in turn is a subset of all software. Malware accounts for a
large proportion of computer security incidents, if not the vast majority, but
they affect Microsoft Windows systems 1 almost exclusively.
Malware is loosely defined as software with malicious intent. If we take the
entire set of software (see figure 1), we assume the vast majority of available
software is not programmed with any malicious intent. Some of this software
contains noticeable bugs, that may or may not cause some damage. However,
we normally assume the intent is benign, and do not consider such anomalies
malicious.
A small portion of the available software was programmed with malicious
intent. This is what we call Malware. However, this term is imprecisely defined.
In order to detect Malware, we need to define a measurable property, with which
we can detect it.
Of all Malware, viruses are the most precisely defined, having a property,
which we will call the virus property, that is defined to be malicious. I define
worms as a type of virus, as they share this virus property. That leaves trojan
1 This is due nearly entirely to the pervasiveness of this family of operating system, and

not to its design.

1
Trojans Viruses
Bugs

Good Bad

Figure 1: Malware

horses. These are problematic in that there is no single property we can use to
identify a Trojan horse.
Before returning to Trojan horses, I will explore viruses in more detail. Co-
hen’s PhD Thesis [?] and book [?] remains the most rigorous theoretical treat-
ment of viruses to this day. In his thesis, Cohen constructs a mathematical
definition of the virus property for a Turing machine, which is very general and
comprehensive. However, for practical purposes we use a definition like this one:

Def 1 (Computer Virus) A Computer Virus is a routine or program that
can infect other programs or routines by modifying it or its environment so that
running the program or routine will result in the execution of a possibly modified
copy of the virus.

There are other similar definitions, some are more or less precise than the
one I gave. In discussing viruses, I will also refer to the virus property for
convenience as the property of self-replication.
N.B.: These definitions assume that you can define “program” and “routine”
for a given operating system. For a system, where such entities do not exist, we
would need a different definition. The definition also implies that the virus may
modify itself while copying, and the modified version must likewise exhibit the
virus property.
The definition is usually extended to include the creation of new executable
objects, i.e., without requiring an existing host executable to modify. In this
case, it is vital that these new objects are likely to be executed. This will depend
heavily on the operating system and typical user behavior.
This definition has false positives, built in. A “copy” program may copy
itself, and therefore exhibits the virus property. By convention, we do not call
such programs viruses, although, by definition, we should. We must identify
such exception to the rule and prevent these benign programs being flagged as
viruses.
Based on a more general and formal definition, Dr. Cohen proved, amongst
other things, that the virus property is undecidable. This has serious conse-
quences as we will discuss later.
A virus may also contain further properties or payloads apart from its self
propagation, such as destruction or espionage, which we consider properties
Software
Observer 1

Trojan Horse sets

Observer 2

Figure 2: The subjectivity of Trojan horses

separate from the virus property. Although these are of concern to use affected
user, they are not important in detecting the virus itself.
Viruses have taken many forms in the last 14 years, and we will continue to
see many variations in the future. Currently, the most common types are the
so-called macro viruses, the Win32 viruses and script viruses [Edi00]. Recently
we have seen increasing numbers of Visual Basic for Scripting viruses, and we
expect to see more in the future.

The AV industry has managed to define the virus property as malicious in the
public consciousness. This is important for us, but it may not be obvious why.
Self-propagation need not be malicious and in nature is usually encouraged.
There has been attempts to create “good” self-replicating agents (viruses). How-
ever, the AV industry has done a good job at pointing out that there is nothing
that can be done in computing with self-replicating agents that cannot be done
in a more secure and controlled manner with non-self-replicating agents. This
allows us to use “self replicating” as a synonym of “malicious” in the context of
software.
This means that we now have a measurable property that viruses must ex-
hibit: self-replication. This property is well-defined and also always malicious,
so identifying this property will not cause false positives. Thus, we have a start-
ing point for automating the analysis of viruses. Such a system needs no human
to make the judgment call: “is the intent of this software malicious?”. The
system itself can identify a malicious object by observing the self-replication.
The down-side is that there will be false-negatives; viruses that are not
detected. There will always be viruses for which the “tools” for measuring the
virus property have not been invented. For this reason, humans will be called
upon to teach the analysis system how to detect these new viruses. More on
this later.
RUBARBRUBARBFOOBARRUBARBVIRUSCODERUBARBRUBARR

Virus Scanner

String Parsers
matching
Emulator
... ...

Figure 3: The principle of a scanner

We do not have the same luxury with other types of malware. “Trojan horses”
are hard to pin one particular property to. In general, “intent” is hard even for
a human to identify and is impossible to measure, but malicious intent is what
makes code a Trojan horse.
It may be possible to define some subclasses of Trojan horses. But even if we
try to define something as specific as “password stealing Trojan” we may run
into trouble. Does this include a program that reads the password file and send
one password to another machine? This could also be a poorly designed pass-
word authentication system. So, even in cases where we can define a property,
we cannot automatically associate this property with “maliciousness” without
human help.
The antivirus industry deals with viruses piece by piece: for every virus
found, a specific antigen is found and deployed. This means that the industry
is perpetually playing catch up. This is far from satisfactory, but works for two
reasons. Firstly, the industry can usually keep false positives very very low by
only detecting known viruses. Secondly, we can remove the offending virus from
the system, but only if the virus has been analyzed beforehand.2
The classical virus scanner used to be a string search tool and a set of strings.
A version of the UNIX program “grep” that takes binary string patterns would
be sufficient. More recently, virus scanners have become much more complex.
The early string-based virus scanners were susceptible to false positives. One
way of dealing with this is to ensure that the string was unlikely to cause false
positives. This was done by making sure it was long enough and chosen carefully.
A set of checksums over certain parts of the virus is used to augment the string.
This allowed so-called “exact identification”. Virus removal instructions are also
embedded in what we now call a virus signature.
The intolerance for false positives cannot be overemphasized. Other intru-
sion detection tools have an astoundingly high false positive rate, but also fairly
high false negative rates. In contrast, the antivirus industry is a fairly mature
part of the security industry and has learned that false positives often cost cus-
2 However, generic or heuristic disinfection is being incorporated into virus scanners, but it

is too early to tell if this technology works reliably.
Analysis center

Central node

Compan
y’s per
imeter
Company node

Client’s PC

Figure 4: Overview of the Digital Immune System

tomers as much as false negatives. Dealing with fixed objects and not sentient
beings has the advantage that the product can be thoroughly tested by third
parties and the vendor.
An antivirus has to be updated very frequently. Very soon, we should be
updating our antivirus software on a daily, perhaps even hourly, basis to main-
tain basic protection. This means even more skilled manpower analyzing viruses
around the clock is needed. Even then, the unknown virus that is spreading on
our machine, will not be found using that version, so we need more than that:
we need a Digital Immune System.

2 The Digital Immune System
The Digital Immune System was built in response to the fundamental problem
in the antivirus industry: the problem of updating the antivirus in a timely
manner [?]. As mentioned in the previous section, the industry can only ever
play catch up with the flood of viruses. However, early studies in computer
virus epidemiology showed that if you could immunize the world population of
PCs faster than viruses could spread, the problem would never reach epidemic
proportions [?].
The bottlenecks involved in this are threefold: it takes time for the vendor
to receive the virus sample (and this was often an unreliable process), it takes
time and human resources to analyze the virus, and it takes time for the antigen
to be distributed to the customer base. The whole cycle can take months, in
which time the virus could potentially have saturated the landscape.
In broad terms, a Digital Immune System must:

1. Discover the virus on client’s machine
2. Capture sample and send it to analysis center
3. Analyze the virus automatically
4. Deliver antigen to client
5. Disseminating the antigen to neighbors and then to the world
To get the suspected virus into the analysis center, one needs to first identify
a sample, package it up for transport and send it via the network. We can
tolerate a certain degree of false-positives in identifying the unknown virus,
so we can use heuristics for finding these suspect samples. In IBM Antivirus,
we used a patented technique that had a remarkably low false-negative and
false-positive rate, which we used for DOS viruses. Symantec’s NAV will use
string-based heuristics on the current breed of viruses.
Packaging the sample involves adding some additional information to the ob-
ject and encrypting the package, so as to make it tamper resistant and invisible
to antivirus software scanning at gateways through which the sample must pass.
This step may also involve removing sensitive information from the object.
The virus is transported via a hierarchy of servers, each only knowing about
its children and its parent. Each node in the tree can do its own scanning so
that a sample making its way up to the analysis center will be rescanned in
case antigen has already been developed by a previous submission of the same
virus and the lower clients and servers haven’t been immunized yet. Each node
also maintains a checksum database of samples it has already seen and prevents
identical samples from reaching the next node up.
Since I’ll be describing the analysis in detail below, I will move on to the
down path. While the antigen is propagating down towards the leaves of the
tree, each node rescans everything that it is holding in its queue to determine
if a particular antigen handles a particular virus submission. In the case of a
widespread outbreak, it is hoped that the virus can be dealt with quicker in
this manner, and the analysis center will not be burdened with dealing with the
same virus more than once.
This description overlooks many of the details. For a more detailed dis-
cussion of the communications infrastructure, please see the papers we have
published in the past [?][?].

3 The analysis center
The analysis center is arguably not the most important part of the Digital
Immune System. It is the infrastructure that immunizes the population of
PC’s from the occurring virus threat. The analysis could be left to humans, if
necessary. The advantage of automation, as always, is speed and response time.
I am describing it here as an example of how a system can be designed that
automatically responds to a threat, in the hope that it inspires others dealing
with different security threats.
The analysis center is designed to receive a suspected virus sample and emit
an antigen, if it was a virus. As with many interesting problems, determining
the virus property is undecidable. The analysis center can only answer with a
“yes, I’ve found a virus”, or a “maybe, I’m not sure”, but never a “no, that was
not a virus”. This is a direct result of the undecidability of the virus property.
The maybe case cannot be decided by a machine. However, a human analyst
can use different tools to dissect the sample using methods that are not easily
automated. The advantage of the automation, is that far fewer samples need
be looked at by humans.
The analysis center must perform the following on all samples:
Communications

Dataflow

Classifier Replication Replicator Analysis SerializationTest
controller

Figure 5: The analysis center

1. Determine whether the sample replicates and generate enough samples for
analysis.
2. Then it must generate an antigen for that virus.
3. Finally that antigen must be tested for false-positives and false-negatives.
We work under the assumption that such a system will work well nearly all the
time, and fail in the worst possible manner when it fails. We therefore put a lot
of thought and effort into the tests.
The system is designed with reliability in mind to maintain a high degree of
availability. The system is composed of a number of modules that perform well-
defined tasks and are called from a central queuing and scheduling entity that
we call the Dataflow system (see figure 5). Each module runs to completion and
then terminates. The Dataflow system monitors these modules and terminates
them if they appear to have crashed or are hung. The only part of the system
that never terminates is the Dataflow system itself.

3.1 The modules
The system comprises of many machines. Each module can run on a subset of
these machines (if the operating system matches its requirements), but only one
process is allowed to run on any given machine. This is better for reliability. If
one module crashes the machine, it doesn’t influence the other machines in the
system. It is also better for scalability as we can increase the throughput of the
system merely by adding more machines. However, this assumes that there is
no serialization point in the system. As originally conceived, there was indeed
no serialization. However, it became clear that even in IBM AntiVirus, there
must be at least a small bottleneck when adding signatures to the database.
However, virus birth rate is still low enough so that this bottleneck is not a
problem.
The most important modules in the system are: a preprocessing step: Clas-
sifier; the all-important replication: Replication Controller and Replication; the
Analysis step; and the postprocessing steps: Serialization and Testing. Figure 5
shows schematically how the modules relate to one another and how work flows
through the system.

Classifier
The roles of the modules are to prepare, to replicate, to analyze and to test. We
call the preparation phase the classifier as it determines what type of sample it
is and sets up directories and data structures for the next phases.

Replication controller
The replication phase comprises of two modules. The replication controller sets
up one or more concurrent replications and then passes control to the replicators,
via the Dataflow system. When they are finished, they pass control back to the
controller, which determines whether there are enough replicants. It is the
controller that sets the general strategy for replication, but the replicators do
the actual work according to the script it has been given. This means that the
controller must do a part of the actual virus analysis.
There may be many rounds of replication involved, until enough (or any)
replicants are produced. Some viruses are difficult to replicate and require
quite obtuse techniques. The controller picks these techniques as they become
necessary.

Replication
The purpose of the replication is to replicate the virus onto files we have original
copies of. We need these files in the generation of antigen and in testing and
is also standard practice amongst human virus researchers. If replication is
successful, we know that the object is really a virus and we have the samples to
prove it.
The actual replication is done in an isolated environment. We currently
use emulators or virtual machines to run and entire virtual PC, although we
have developed pure hardware solutions as well for experimental purposes. Our
experiments have shown that although using a real replicator machine instead
of a virtual or emulated one, sometimes achieves better results and replicates
faster3 , it is far more difficult to integrate in the automation of the analysis
center.
The virus is inserted into a prepared image with the target operating system
and applications preinstalled. When the emulator is started, a program, we
call the replication controller, is run that executes commands on the operating
system in an attempt to replicate the suspected virus. The program that does
this is script driven and has the ability to detect virus activity on its own. For
example, in some cases, the virus only installs itself to be run the next time the
machine is rebooted. In this case, the script reboots the machine.
3 Although the setup time is typically much longer
1.
Sample, Goats, Applications 2.

Installer
Emulator/Virtual Machine

Operating Apps
System

Disk image
File! Reg! RC
Mon Mon

Comm. server
3.

Extractor

Modified files

Figure 6: The virus replicator

On some replicators, there is also an auditing component installed on the
target system. This currently logs all filesystem and registry activity (see also
[?]). We can evaluate this data for known patterns of infection and other prop-
erties the virus exhibits, but currently we only use this mechanism to ensure
that the system is still alive and the replication controller is still active.
There are many factors that can prevent replication. For instance, the virus
may be very sensitive to the operating system variant or language. To coun-
teract this, we use a wide selection of disk images we can choose from, and the
controller makes an informed guess as to what image to try. We can also modify
the applications preinstalled on the disks, for example to change its language
version. However, it is inefficient to try all images and the controller may not
have enough information to guess correctly, so these difficult viruses will not
always be replicated automatically. It’s a consolation that these viruses will
probably not spread very far in the wild either.

Analysis
The analysis takes the replicants that were generated and attempts to generate
antigen. This phase is proprietary to the target antivirus, but involves finding
good detection signatures. In the simplest case, the virus samples are all se-
quenced, and a string is selected to have a statistically low false positive and
false negative rate. Then identification and removal information is generated
based on the replicants and their originals [?][?]. There is usually a quick test
of the generated signature at this point before the new signature is sent to be
integrated into the master set of signatures.
Raw samples
Preanalysis &
sorting Subtraction of originals

!

!"
Pure samples Pure
Viruses

Sequencing

Signature

Figure 7: The analysis

Serialization
The signature is added to the current set of signatures at a serialization point
in the system.

Testing
The new combined signature set is tested against the samples that were gener-
ated as well as a standard set of clean files.
Then, at last, the new signature file is released to the communications system
for transportation to the customers.

There is another dimension to these modules: the different types of viruses that
the system handles. In the classifier phase, it is determined what type of sample
we were sent. Currently we handle PC-DOS, Macro, Win32 and can partially
handle VB-Script viruses. In future there will be other types of viruses for which
new modules will have to be written and, of course, the modules will have to be
improved as the viruses evolve.

3.2 Status
A version of the Digital Immune System clients, communication system and
analysis center has been successfully tested in a pilot program for some select
Symantec customers. In regression tests, it handles most of the macro viruses
and about half of the known Win32 and DOS viruses, although this number
is always increasing. It should also be noted that we expect to do better in
production use as the test set we use includes viruses that either do not run or
barely run. Input from the wild should usually only include those viruses that
actually can replicate otherwise they would never have spread in the first place.
Currently, the system is being expanded for production use, so that it will be
capable of handling the expected load a widescale deployment would generate.
This should be deployed this year.
However, even if the communications part of the Digital Immune System is
finished, the analysis center is in constant flux, as new types and subtypes of
viruses are created. The analysis center can handle any previously known type
of virus, but must be taught how to handle new types of viruses. Fortunately,
the introduction of new types of viruses occurs far less frequently than new
viruses of existing types. Most viruses are members of large families, as they
are merely modified versions of other viruses.

4 Other applications of this methodology
Since the methodology of the analysis center is so successful in verifying and
analyzing computer viruses, could this technology be applied to other security
threats? We have not actively tried this yet, but have been thinking of how this
could be done for a while.
It was comparatively easy to apply such technology to viruses as this class of
Malware has an objectively measurable property, replication, which is accepted
as being malicious. In order to detect other forms of Malware, we will need to
find just such a property for each class of Malware we are interested in.
This is fraught with difficulties. For instance, take a trojanized login pro-
gram. The standard login program has been (maliciously) modified to accept a
particular login name and password combination as root. This is hardcoded into
the program instead of added to the password file. This Trojan gives the hacker
a backdoor into the system which isn’t obvious to the administrator looking at
the password file.
If we know that a hacker did this modification, then we can safely call this a
Trojan horse. But what if, for example, the hardware maintenance subcontrac-
tor for the company made this modification, so that they could monitor their
hardware for defects remotely? Is this still a Trojan horse? Even if this was an
ill-advised design decision, we probably wouldn’t call it a Trojan.
Hypothetically however, we could call all such backdoors malicious (and
annoy the subcontractor no end). In an automatic analysis center, we would
have to design tests for such a property. Optimistically, we can find feasible
stimuli and appropriate sensors that can verify our suspicions. So ignoring all
the technical difficulty, we can find some properties that can be objectively
tested against. As we have defined this property malicious, we can automate
the detection of this type of Trojan horse.
Could we do this for all Malware? Probably not. We could define a large
set of comprehensive properties and allow exceptions, that we then classify as
benign, but I believe there will always be a grey zone of software where people
will disagree about its intent.
Could one apply this technology to other types of intrusions? The advantage
of dealing with objects is that we are given an inanimate non-sentient entity
with fixed properties that we can test for as long as we need to. When dealing
with human intrusions, we have none of these advantages. The entity we are
studying cannot be easily confined for observation (and may not cooperate if we
could.) Nor does it have a finite set of properties. At any time, it may change
its behavior so that obtaining any meaningful signature is unlikely.
There are conceptually similar ideas that have been proposed and some-
times put into practice. Studying hacker behavior is sometimes done by routing
his/her traffic into a so-called honeypot – a isolated network which superficially
resembles the real network the administrator is trying to protect. By doing this,
the administrator can learn of potential weaknesses in his defenses, or previously
unknown exploits. Unfortunately, it seems unlikely that the setting up of the
honeypot and the observation of the hacker can be automated in any meaningful
way.
There may be a type of hacker attack, from which meaningful signatures
may be gleaned. Some hackers use scripts to execute the actual attack – we call
these hackers “script kiddies”. As these are systematic by nature, by looking
for unusual but repeated patterns in audit data we can develop signatures for
these attacks that can be implemented in an intrusion detection system. This
would be no trivial task and the technique of collecting data is not as controlled
as in the case of viruses. The data may therefore be more noisy.

5 Conclusion
Computer security is inherently hard, this much is known for sure. As new
potential for misuse show up every day, we can only hope to patch up the
problems as fast as they get abused. With the Digital Immune System analysis
system, we have gone very far in automating the process of dealing with threats
from new viruses, an important part of computer security. We’ve been helped
by the fact that viruses are well-defined entities and that we can equate the
virus property unequivocally with maliciousness.
The success of this approach begs the question of whether we can find similar
techniques to deal with other types of Malware or even other forms of intrusion.
A first step seems to be finding a precise definition and then developing tests
for the properties of the entities. I believe this is possible with Malware, but
whether this approach will result in a useful system remains to be seen. The
approach applied to other forms of intrusion will only have limited success as
we are dealing directly with human beings.

References
[Edi00] Editors. Prevalence table. Virus Bulletin, page 3, September 2000.