You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221548746

Static Code Analysis to Detect Software Security Vulnerabilities - Does


Experience Matter?

Conference Paper · January 2009


DOI: 10.1109/ARES.2009.163 · Source: DBLP

CITATIONS READS

25 1,618

4 authors, including:

Kai Petersen Bengt Carlsson


Blekinge Institute of Technology Blekinge Institute of Technology
86 PUBLICATIONS   3,089 CITATIONS    38 PUBLICATIONS   392 CITATIONS   

SEE PROFILE SEE PROFILE

Lars Lundberg
Blekinge Institute of Technology
195 PUBLICATIONS   1,301 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Understanding Agile Practice Interdepedencies - A Survey View project

EASE Theme E - Decision support for software testing View project

All content following this page was uploaded by Kai Petersen on 19 May 2015.

The user has requested enhancement of the downloaded file.


Static Code Analysis to Detect Software Security Vulnerabilities -
Does Experience Matter?

Dejan Baca1,2, Kai Petersen1,2, Bengt Carlsson2, Lars Lundberg2


2
School of Engineering, Blekinge Institute of Technology
{dejan.baca | kai.petersen | bengt.carlsson | lars.lundberg}@bth.se

Abstract To achieve this task, one has to identify the defects


that cause source code vulnerabilities. Two main semi-
Code reviews with static analysis tools are today automated alternatives are possible, namely static and
recommended by several security development dynamic detection of faults. The dynamic detection of
processes. Developers are expected to use the tools’ faults is done late in the process as it requires the
output to detect the security threats they themselves software system to be complete and executable. If the
have introduced in the source code. This approach system fails, then the location of the cause of the failure
assumes that all developers can correctly identify a has to be identified. The static analysis does not require
warning from a static analysis tool (SAT) as a security that the system is executable. Instead, it can be run on
threat that needs to be corrected. We have conducted incomplete parts of the system (e.g,. just a small part of
an industry experiment with a state of the art static the overall code base) and highlights the cause of a
analysis tool and real vulnerabilities. We have found possible failure. This leads to two main benefits: 1)
that average developers do not correctly identify the Faults can be detected early in development and thus
security warnings and only developers with specific are much less expensive to fix compared to late
experiences are better than chance in detecting the detection [3], and 2) the location of a failure does not
security vulnerabilities. Specific SAT experience more have to be identified.
than doubled the number of correct answers and a From a security perspective, the system is not able
combination of security experience and SAT to determine which of the identified problems are
experience almost tripled the number of correct security vulnerabilities. Instead, this judgment has to be
security answers. made by developers who use the output of the static
code analysis tools. In order to select the right people
1. Introduction for the task, one has to know which experience is
necessary to succeed in judging the output of the static
In recent years software companies have concerned code analysis tool. That is, if certain types of
themselves with security related threats connected to experience (e.g., knowledge in security, knowledge in a
their products’ source code. This is related to programming language, and experience with the static
customers demanding high security in software code analysis tool) matter, then they have to be
products and that software products are used in considered when assigning people to the task.
networks where software vulnerabilities can be However, several studies and articles have shown that
exploited (e.g., for intrusion) [1]. With these new the SAT are indeed useful tools with lots of potential
threats, software companies have to improve the [9][10][11], but all previous studies have been done by
security of their products by preventing code experts assuming that developers using the tool will
vulnerabilities. In consequence, software companies have the same results. That is, no empirical evidence is
now have to develop products with secure code and provided on the impact of experience on the results of
verify that legacy code is secure. This is a task that can static code analysis usage.
slow ongoing development and is very expensive if the To address this research gap the aim of this study is
product has a large legacy code base. to determine the impact of experience on the correct

1
Authors are also affiliated with Ericsson AB, Box 518, SE-37123 Karlskrona, Sweden, dejan.baca@ericsson.com,
kai.petersen@ericsson.com
identification and classification of faults identified by a suggests that about 5% of the tool generated warnings
static code analysis tool (SAT). The study has been have been security faults that need correction.
conducted as an industry experiment with 34
developers. Furthermore, the perceived confidence in
the answers from a single developer has been asked for 3. Research Method
to control whether the answers can be considered as not
random. 3.1 Variables
We therefore want to answer if SATs are useful for
Two different types of variables are usually
average developers as a vulnerability detector, if
considered when conducting experiments, namely
developers with certain experience get better results
independent variables and dependent variables.
and if the developers can identify when they need aid in
The independent variables (or treatments) are what
interpreting the SAT.
the researcher is controlling, and the dependent
In section 1 the introduction and research problem is
variables are measured outcomes. In this case, the
presented. Section 2 explains the background and
variable experience is controlled. As outcome variables
related work In Section 3 our research method is
we consider the ratio of different fault types classified
explained, followed by the results in Section 4. In
correctly by the developers (see Figure 1).
Section 5 we discuss the impact and possible reasons of
our result followed by our conclusions in Section 6.

2. Background and Related Work


The first static analysis tools (SAT) were simple
program checkers [6] that were no more sophisticated
than a simple grep or find command. These simple
SATs were soon followed by several commercial tools
that for a variety of coding languages tried to achieve
an automated source code review. These tools capture Figure 1. Independent and dependent variables.
the most common faults in a particular coding language
and help the developers to create more stable code. The The dependent variables are divided into three
tools represent a varying degree of sophistication and groups depending on the warning type.
some utilize complicated techniques such as an abstract • False positive – An incorrectly reported warning.
interpretation to achieve their goal of better • True positive – A correctly identified fault.
accomplished automation [7]. A coding fault also • Security – A correct warning that can propagate
represents a large quantity of software vulnerabilities into one of the four security vulnerabilities that are
where SATs have the capability to detect these faults. explained in section 3.3.
SATs are therefore good security touch points that can Security and true positives are divided because the
be introduced early in the development process [8]. warnings can be interpreted and corrected in different
Experiences from industry show worse than expected ways depending on how they are reclassified.
results [12] and, even with a SAT, faults that should The independent variables examine how
have been detected have slipped through the static experienced the developers are with: the coding
analysis process. language, software development, the product, software
A SAT is used early in development process to security and the SAT. The developers’ answers are
prevent fault slip through [14], i.e. catching the failures rated in a four grade scale, from guessing to very
before testing is cheaper than finding them during confident. The experience data are then used for further
testing and then having to correct and retest faults. The analysis of different groups compared to their ability to
SAT is mostly automated. Thus, it is fast and classify a warning.
inexpensive. Still there is a human factor when
examining if a warning should be corrected; it is
3.2 Subjects
needed because the SATs have false positives [15]. A
warning is considered false if its statement is The software developers where randomly selected
considered wrong or if the developer does not believe from the list of developers at Ericsson AB, a major
it needs correcting. The tool Coverity Prevent has a telecom company, from different development sites
20% false positive rate [12]. The same study also (Sweden and India). The developers vary in experience
and have or will use the SAT as part of their daily
work. The study was voluntary and all the developers detected and undetected by the SAT. The results from
had the same briefing prior to the study. In total 34 the developers were compared to a correct template,
developers answered the questionnaire. From the total i.e. a previous study that methodically examined and/or
group, 15 had experience in software security and 14 practically confirmed those warnings when doubt
had used a SAT before the study. The developers were arose. Also, some warnings were confirmed by trouble
divided into three groups depending on their general reports from testers and end users. The sample
software engineering experience. To determine the contained 13 different checkers that looked for a
developers experience we examined how long they specific type of fault. These checkers can be grouped
have worked in software development and how long based on the characteristics of the fault they detect. The
they have experience of development in the products following groups of warnings provided in the form are
programming language. The low experiences group had of security relevance:
10 developers and less then 2 years development • Memory faults (e.g., buffer overflows) lead to
experience. The medium experience group had 14 memory corruptions which can be exploited (e.g,.
developers while the high experience group had the for code injection) [16].
remaining 10 developers. The high experience group • Null-pointer exceptions which allow to user-
consisted of developers with more then 6 years of induced segmentation faults [16].
development experience. Specific knowledge groups • Initialization checkers determine if allocated
consisting of security and SAT experience were also resources are used before they are properly
examined. Specific knowledge was not measured in initialized causing information leaks or
time, instead developers answers a simple yes or no segmentation faults [16].
question. • Race conditions allowing to lock or link system
resources (e.g., processor or physical resources)
3.3 Instrumentation [16].
The instruments used in the experiment are written Tools and systems used: In this study we had
guidelines, forms used by the subjects to record their chosen the SAT Coverity Prevent version 3.1. Coverity
results, and the tools and systems used. Prevent is one of the state of the art SAT and has been
Written guidelines: For the experiment, each subject used in other studies to detect customer reported bugs
received one page explaining their task, including and vulnerabilities [13]. This tool was chosen after an
explanations of classification types (false positives, true internal study showing that a low false positive rate was
positives, and security fault). Security faults are further combined with a high rate of reported faults. Because
explained in the instructions following the definitions the tool is automated and has a low false positive rate
of vulnerabilities [16]. This definition includes the (10-20%), industry often uses the SAT as a drop in tool
coding faults that cause a Denial of Service attack, without any training. A more security focused SAT,
unauthorized disclosure, unauthorized destruction of Fortify, was not used in this study due to its higher
data, and unauthorized modification of data. false positive rate and other “free-ware” tools were too
Forms: The evaluations form was created from a simplistic in an industrial setting.
random sample of warnings to create eight false
positives, eight true positives and six security warnings. 3.4 Operation
All developers then classified the same randomly The study was conducted in two distinct steps. In
generated sample. At the time of the investigation the step one the developers received an introduction to the
product had an approximated ratio of 25 true positives study, where each developer received exactly the same
and 8 false positive for every vulnerability, i.e. true information. This included a description of their task
positives are underrepresented and security warnings (e.g., explanation of the classification) and they were
are highly overrepresented in the questionnaire due to informed that they should do the task individually and
the need of sufficient data points. The original should not talk to their peers about it. Furthermore, the
classification was done by developers working on the forms have been explained and the written guidelines
product. The random sample was also examined further were handed out to the subjects. In the second step the
on to insure that the original classification was correct. developers conducted the experiment on their own. For
In some cases the vulnerabilities were also practically the task one hour was recommended by the
verified for being accurate. The accuracy had been experimenters. After the individual experiment, the
confirmed as follows: The warnings were taken from a subjects sent back their filled in forms for analysis.
mature product still under development. The over a
year old source code in use had some known flaws both
This is made even worse because the operation is
3.5 Threats to Validity initiated from a user message and the three conditions
One threat to the generalizability of the results of are controlled by the user. In this case an outside user
the study is the usage of one specific SAT. In order to could use this vulnerability to use up all the systems
mitigate this risk we choose a state of the art tool with a memory and create an denial of service attack.
large industrial acceptance. The largest different Example 2. More complex denial of service attack,
between different tools is not the type of defects were three conditions have to be met before the
detected, but the number of false positives compared to attack is possible.
Event alloc_fn: Called allocation function
false negatives. Because developers examine the output "operator new (unsigned int)"
of the tool, the user friendliness can effect the results. 7859: daData = new CcnDdrSet(daTagNo);
But all previously examined SAT, open source tools,
At conditional (1):
Klockworks and Fortify all presented there results in "StringTokenizer::hasMoreTokens() != 0" taking
the same way. All SAT showed the results as warning true path
text either in a report or directly in the source code.
7863: if (st.hasMoreTokens())
Because all top tools present the results in the same 7864: {
way, we can assume that the tools interface would not
At conditional (2):
effect generalization. Another threat is whether the "DicosString::compare(const char *) const !=
results are applicable for the industry or not. As this 0" taking true path
experiment entirely includes industrial developers, the
7867: if (daAmountString.compare("") != 0)
results can be generalized to industry developers in 7868: {
common. Overall, the external validity of the study thus
can be considered as high. At conditional (3): "daAmount > 1000000"
taking true path
To mitigate the risk of other factors than experience
influencing the outcome of the study, the study was 7872: if (daAmount > MAX_DA_AMOUNT)
voluntary and stressed developers could ignore it. All 7873: {
developers got the same initial information and used Event leaked_storage: Returned without freeing
the same interface to examine and report their finding. storage "daData"
7875: return false;
Developers did the study individually and had the same
time frame. Thus, the risk of other factors affecting the
dependent variables is reduced. 4. Data Analysis
In this experiment a group of 34 developers have
3.6 Static analysis tool output examples individually classified 8 (272 in total) false positive, 8
SAT often provide output directly in the source (272 in total) true positives and 6 (204 in total) security
code so that developers can read the warnings and at warnings. In figures 2 to 6 the average chance rate is
the same time see there own code and thereafter judge 33% shown in every graph for reference, but notice that
if it needs to be corrected. In the examples the numbers the actual chance rate is 36% for false positives and
followed by a ‘:’ is actual source code while rows true positives and 27% for security warnings due to the
without starting numbers are warnings from the tool. different number of items in each category. We
Example 1 is a simple off by one buffer overflow that calculate change based in the number of correct
does not cause any security problems but should be answers a developer would get if he assumed all
fixed by the developers. warnings would be true. Figure 2 shows how many
Example 1. A simple off by one warning. percent of the warnings are correctly classified by the
Event overrun-local: Overrun of static array developers. In section 3.2 the three warning groups are
"buff" of size 32 at position 32 with index explained. Only the true positives are identified better
variable "32"
30: buff[32] = '\0'; then chance, while both false positive and security
warnings are within the chance rate of 27-36%
But the tool can produce more complex warnings (depending on whether the category contains 6 or 8
were the fault only occurs if the program follows a warnings).
specific flow. These warnings are harder to understand
and the developers have to read and understand every
warning message. Example 2 has a more complex flow
that if three conditions are met the program would
return with an error but not de-allocate it used memory.
Figure 2. Correct classifications per warning type. Figure 4. Classifications grouped by the developers
general experience.
In figure 3 the first results are divided into groups
depending on how confident the developers are in their Figure 5 examines the two most important specific
classification of the warnings. In this division four experiences. Here, 15 developers with security
groups are created, that is: 157 warnings are classified experience are compared to 19 without. Furthermore,
as very confident, 325 as confident, 175 as not 14 developers with prior knowable in SAT are
confident and 97 as guessing. compared to the remaining 20 developers with no
True positives behave almost as expected, i.g. less knowledge in SAT.
confident answers have fewer correct answers. False While the security group is better then the non
positive on the other hand behaved exactly in the security group in detecting vulnerabilities the results is
opposite way. That is, the results are better when they only just better then chance. The group with SAT
developers are uncertain in their answers. Security experience is the first group with substantial
vulnerabilities vary within the chance range and no improvement in identifying the security vulnerabilities
improvement claims can be made with confidence. and the non SAT group has a worse security detection
result.

Figure 3. Correct answers per confident group.


Figure 5. Classifications based on two specific
Because Figure 3 did not show that confidence experience groups.
improves the identification of security vulnerabilities
we have excluded confidence from further analysis and The best security detection group is created by
all answers, from very confident to guessing, are used. combining developers with both security experience
In figure 4 the results are grouped by the and SAT experience. This group consists of only 7
developers’ general experience. We get three groups developers while the remaining are 27 developers. The
with 10 developers in the low group, 14 in the medium security (SEC) & SAT group correctly classified 67%
and 10 in the high experience group. The results do not of the security vulnerabilities which is an improvement
vary noteworthy between the three groups. The security to the SAT only group. Figure 6 shows developers with
vulnerabilities also stay within the chance range. both SAT and security experience compared to the
developers with no specific or only one specific
experience.
The general development experience did not have
the expected impact on the number of correctly
identified security vulnerabilities or any warnings. It
could be assumed that developers that have spent more
time in development would be better then novice
developers in correctly classifying all types of
warnings. While we see a small improvement from low
to medium experienced developers, this improvement
stagnates within the high experience group. These
results are positive for the SAT as it shows that every
developer within the project can use the tool and get
Figure 6. Correct classifications from developers similar results, how good these results are is another
with a combined specific experience. issue. A cause of the lack of improvement could be that
the output from the tool is so obscure and hard to read
that even more experienced developers have problems
5. Discussion
understanding the warnings.
When software tools are evaluated for industry or In the specific experience we see the first
written about in articles the focus is almost always on improvement in detecting security vulnerabilities. Tool
the tools’ capacity. Static analysis tools have had experience is the single most important factor. Security
several studies that have shown the tools’ ability to experience on the other hand did not improve the result
detect coding faults, failures and even security substantially. But for the best results the developers
vulnerabilities. There are several different tools to should have security and SAT experience. No other
choose from that have different behaviors and specific experience showed better results, indicating
characteristics. Different characteristics might be that educating developers is not the best way to get
preferred depending on where in the company or better SAT classifications. Instead, they need to get
development cycle the tool is deployed. If the static practical experience from projects combining security
analysis tool is used by a security assurance team the and the use of SAT.
soundness and capability of the tool might be more The tools user friendliness comes into question
important then it's false positive rate. But if the tool is because tool experience was the most important factor
intended to be used as an early fault detector on a daily when classifying warnings. Because all tools use the
basis, then the tools actual capacity might, due to same method to present there results it can be
inexperienced developers, be lower than its full considered an industry accepted standard that every
potential. one follows. More specific studies have to be
In Figure 2 we examined the average result of a conducted to determine if the output can be presented
developer in classifying warnings. Without creating any in a better way and if the classification result can be
groups regarding experience, or removing any answers improved. But a better solution would be to reduce or
due to unconfident answers, we clearly see that true eliminate false positives entirely so that developers
positives have been mostly classified in the correct could always assume the tool was correct.
way. So, depending on whether the tool is used by a
Dividing the answers based on the developers security assurance team or as an early fault detection
confidence in their classification does not show the tool the outcome may vary. For a security assurance
expected results. Relative few answers are guessed and team a sound tool with higher false positive rate and a
the majority of developers are confident in their minimum of un-reported warnings is acceptable and
classification. A positive judgment is expected in both, even preferable. An early fault detection tool is used
the security and false positive warnings. Instead the daily during implementation and therefore a low false
judged security warnings are randomized and the false positive rate with a slightly increased rate of un-
positive result improved the more uncertain the reported warnings may be preferred. Combined with
developers are, i.e. the outcome improves with our results the importance of a low false positive rate in
increased uncertainty. The difference in results the SAT is further strengthened.
indicates that developers do not correctly judge how
correct their classification is. We therefore do not
believe that developers can judge when they need
6. Conclusion
assistance in classifying a SAT warning. Overall, all developers in the experiment are good at
identifying true positives while false positive and
security vulnerabilities are much harder to correctly [4] B. De Win et al., “On the secure software development
classify. Neither false positive nor security process: CLASP, SDL and Touchpoints compared,”
vulnerabilities are identified better then chance. Also, Information and Software Technology, vol. In Press,
Corrected Proof;
developers are bad at judging how correct their own
[5] B. Boehm and V. Basili, “Top 10 list [software
classifications are. development],” Computer, vol. 34, 2001, pp. 135-137.
We see no improvement in the classification with [6] J. Viega et al., “ITS4: a static vulnerability scanner for C
increased development experience or any individual and C++ code,” Computer Security Applications, 2000.
experiences that help the developers in detecting the ACSAC '00. 16th Annual Conference, 2000, pp. 257-267.
false positives. For security vulnerabilities, a [7] S. Hallem, D. Park, and D. Engler, “Uprooting Software
combination of security experience and SAT Defects at the Source,” Queue, vol. 1, 2003, pp. 64-71.
experience leads to the best results where using the tool [8] N. Mead and G. McGraw, “A Portal for Software
seems more important than having general experience Security,” Security & Privacy, IEEE, vol. 3, 2005, pp. 75-
79.
in security. The combination of SAT and security
[9] B. Chess and G. McGraw, “Static analysis for security,”
experience almost triple the number of correct security Security & Privacy, IEEE, vol. 2, 2004, pp. 76-79.
answers. But just deploying a SAT to all developers [10] N. Ayewah et al., “Using Static Analysis to Find Bugs,”
will not initially have any better than random Software, IEEE, vol. 25, 2008, pp. 22-29.
classifications of security vulnerabilities. With time, as [11] M. Zitser, R. Lippmann, and T. Leek, “Testing static
the developers’ understanding of the tool increases, the analysis tools using exploitable buffer overflows from open
better their security results will be (more than doubling source code,” SIGSOFT Softw. Eng. Notes, vol. 29, 2004,
the number of correct answers). pp. 97-106.
Another observation of this study is that developers [12] D. Baca, B. Carlsson, and L. Lundberg, “Evaluating the
cost reduction of static code analysis for software security,”
trust the tool and often assume that the tool is correct in
Proceedings of the third ACM SIGPLAN workshop on
its detection. At the same time no experience improved Programming languages and analysis for security, Tucson,
the false positive rate and the developers could not AZ, USA: ACM, 2008, pp. 79-88;
determine how good they have classified the warnings. http://portal.acm.org.miman.bib.bth.se/citation.cfm?id=1375
These observations show that for a SAT to be an 696.1375707.
effective early fault or vulnerability detector its false [13] P. Emanuelsson and U. Nilsson, “A Comparative Study
positive rate has to be very low, which is an important of Industrial Static Analysis Tools,” Electronic Notes in
factor in the tools’ usefulness. Theoretical Computer Science, vol. 217, Jul. 2008, pp. 5-21.
[14] L. Damm, L. Lundberg, and C. Wohlin, “Faults-slip-
through - a concept for measuring the efficiency of the test
7. References process,” Software Process: Improvement and Practice, vol.
[1] M.T. Hunter, R.J. Clark, and F.S. Park, “Security issues 11, 2006, pp. 47-59.
with the IP multimedia subsystem (IMS),” Proceedings of [15] J. Zheng et al., “On the value of static analysis for fault
the 2007 Workshop on Middleware for next-generation detection in software,” Software Engineering, IEEE
converged networks and applications, Port Beach, Transactions on, vol. 32, 2006, pp. 240-253.
California: ACM, 2007, pp. 1-6; [16] I.V. Krsul, “Software vulnerability analysis,” Ph.D.
[2] M. Daran and P. Thévenod-Fosse, “Software error Dissertation, Computer Sciences Department, Purdue
analysis: a real case study involving real faults and University, Lafayette, IN, May, 1998.
mutations,” SIGSOFT Softw. Eng. Notes, vol. 21, 1996, pp.
158-171.
[3] B.W. Boehm, “Software engineering economics,”
Software pioneers: contributions to software engineering,
Springer-Verlag New York, Inc., 2002, pp. 641-686;

View publication stats

You might also like