Professional Documents
Culture Documents
ABSTRACT
Recently, large language models for code generation have achieved
breakthroughs in several programming language tasks. Their ad- Prompt
vances in competition-level programming problems have made
them an emerging pillar in AI-assisted pair programming. Tools Vulnerable
arXiv:2302.04012v1 [cs.CR] 8 Feb 2023
such as GitHub Copilot are already part of the daily programming Code
workflow and are used by more than a million developers [50]. The
F
training data for these models is usually collected from open-source
Black-box Code
repositories (e.g., GitHub) that contain software faults and security F-1 ( ) ≈
Generation Model
vulnerabilities. This unsanitized training data can lead language
...
models to learn these vulnerabilities and propagate them in the
code generation procedure. Given the wide use of these models in
the daily workflow of developers, it is crucial to study the security
aspects of these models systematically. Figure 1: We systematically and automatically find vulner-
In this work, we propose the first approach to automatically find- abilities and associated prompts by approximating the in-
ing security vulnerabilities in black-box code generation models. verse of black-box code generation model F via few-shot
To achieve this, we propose a novel black-box inversion approach prompting. Given a vulnerability , we use the black-box
based on few-shot prompting. We evaluate the effectiveness of code generation model itself to find relevant prompts
our approach by examining code generation models in the genera- and check if they indeed produce vulnerable code .
tion of high-risk security weaknesses. We show that our approach
automatically and systematically finds 1000s of security vulnerabil-
ities in various code generation models, including the commercial
black-box model GitHub Copilot. the massive amount of unsanitized training data [29, 46]. In fact,
Codex is trained on unmodified source code hosted on GitHub.
KEYWORDS While the model is trained, it also learns the training data’s cod-
Language Models, Machine Learning Security, Software Security ing styles and—even more critical—bugs that can lead to security-
related vulnerabilities [34, 35]. Pearce et al. [34] have shown that
minor changes in the text prompt (i.e., inputs of the model) can
1 INTRODUCTION lead to software faults that can cause potential harm if the provided
Large language models represent a major advancement in current code is used unaltered. The authors use manually modified prompts
deep learning developments. With increasing size, their learning and do not provide a way to find the vulnerabilities of the code
capacity allows them to be applied to a wide range of tasks, such generation models automatically.
as text translation [8, 13] and summarization [13, 33], chatbots In this work, we propose an automated approach that finds
like Chat-GPT [32], and most recently for code generation and prompts that systematically lead to different kinds of vulnerable
code understanding tasks [11, 20, 26, 31]. A prominent example codes and enables us to examine the models’ behavior on a large
is GitHub Copilot [16], an AI pair programmer based on OpenAI scale. More specifically, we formulate the problem of finding a set
Codex [11, 24] that is already used by more than a million develop- of prompts that cause the code generation models to generate vul-
ers [50]. Codex [11] and other models such as CodeGen [31] and nerable codes as a model inversion. We can find the potential input
InCoder [20] are trained on a large-scale corpus of open-source scenarios using the inverse of a language model and a generated
code data and enable powerful and effortless code generation. Given vulnerable code. However, we do not have access to the true distri-
a text prompt describing a desired function and a function header, bution of the vulnerable codes. More crucially, it is unclear how we
Copilot generates suitable code in various programming languages can access the inverse of the target model in the black-box setting.
and automatically completes the code based on the user-provided Recently, large language models have shown a surprising ability to
context description. According to GitHub, developers who use generalize to novel tasks by providing few-shot prompts (in-context
GitHub Copilot implement the desired programs 55% faster [50], examples) [8]. A few-shot prompt contains a few examples of a
and nearly 40 % of the code written by programmers who use Copi- specific task to teach the pre-trained model to generate the desired
lot as support is generated by the model [16]. output. In this work, we use few-shot prompting to guide the target
Like any other deep learning model, large language models such black-box model to act as its inverse. In other words, we direct the
as Codex, CodeGen, and InCoder exhibit undesirable behavior in model to generate the desired outputs by providing a few examples
some edge cases due to inherent properties of the model itself and of vulnerable codes and their corresponding prompts.
1
Hossein Hajipour, Thorsten Holz, Lea Schönherr, Mario Fritz
By using few-shot prompting, we approximate the inverse of et al. [4], Wang et al. [45] employ encoder-decoder architecture
the code generation model in the black-box setting. We use the to tackle code-to-code, and code-to-text generation tasks, includ-
approximated inversion of the code generation model to generate ing program translation, program repair, and code summarization.
prompts that potentially lead the models to reveal their security Recently, decoder-only models have shown promising results in
vulnerability issues. Figure 1 provides an overview of our black- generating programs in left-to-right fashion [11, 31]. These models
box model inversion approach. Using our method, we have found can be applied to zero-shot and few-shot program generation tasks
1000s of vulnerabilities in state-of-the-art code generation models. [11, 20, 31], including code completion, code infilling, and text-
These vulnerabilities contain twelve different types of Common to-code tasks. Large language models of code have mainly been
Weaknesses Enumerations (CWEs). evaluated based on the functional correctness of the generated
In summary, we make the following contributions in this paper: codes without considering potential security vulnerability issues
(1) We propose an approach for automatically finding security (see subsection 2.3 for a discussion). In this work, we propose an
vulnerabilities in code generation models. We achieve this approach to automatically find security vulnerabilities of these
by proposing a novel black-box model inversion approach models by employing a novel black-box model inversion method
via few-shot prompting. via few-shot prompting.
(2) We discover 1000s of vulnerabilities that we found in state-of-
the-art code generation models—including the widely used
Github Copilot. 2.3 Security Vulnerability Issues of Code
(3) At the time of publication, we will publish a set of promising Generation Models
security prompts to investigate the security vulnerabilities of Large language code generation models have been pre-trained using
the models and compare them in various security scenarios. vast corpora of open-source code data [11, 20]. These open-source
We generate these prompts automatically by applying our codes can contain a variety of different security vulnerability issues,
approach to finding security issues of different state-of-the- including memory safety violations [41], deprecated API and algo-
art and commercial models. rithms (e.g., MD5 hash algorithm [34, 37]), or SQL injection and
(4) At the time of publication, we will release our approach as cross-site scripting [34, 40] vulnerabilities. Large language models
an open-source tool that can be used to evaluate the security can learn these security patterns and potentially generate vulnera-
issues of the black-box code generation models. This tool can ble codes given the users’ inputs. Recently, Pearce et al. [34] and
be easily extended to newly discovered potential security Siddiq and Santos [40] show that the generated codes using code
vulnerabilities. generation models can contain various security issues.
We provide the generated prompts and codes with the security Pearce et al. [34] use a set of manually-designed scenarios to
analysis of the generated codes as additional material. investigate potential security vulnerability issues of GitHub Copi-
lot [16]. These scenarios are curated by using a limited set of vul-
2 RELATED WORK nerable codes. Each scenario contains the first few lines of the po-
In the following, we briefly introduce existing work on large lan- tentially vulnerable codes, and the models are queried to complete
guage models and discuss how this work relates to our approach. the scenarios. These scenarios were designed based on MITRE’s
Common Weakness Enumeration (CWE) [1]. Pearce et al. [34] eval-
2.1 Large Language Models and Prompting uate the generated codes’ vulnerabilities by employing the GitHub
Large language models have advanced the natural language process- CodeQL static analysis tool. Previous works[34, 39, 40] investigated
ing field in various tasks, including question answering, translation, the security issues of the code generation models using a set of
and reading comprehension [8, 36]. These milestones were achieved limited manually-designed scenarios. In contrast, in our work, we
by scaling the model size from hundreds of millions [15] to hun- propose a systematic approach to finding security vulnerabilities
dreds of billions [8], self-supervised objective functions, and huge by automatically generating various scenarios at scale.
corpora of text data. Many of these models are trained by large
companies and then released as pretrained models. Brown et al. [8]
show that these models can be used to tackle a variety of tasks by 2.4 Model Inversion and Training Data
providing only a few examples as input – without any changes in Extraction
the parameters of the models. The end user can use a template as a Deep model inversion has been applied to model explanation [28],
few-shot prompt to guide the models to generate the desired output model distillation [47], and more commonly to reconstruct private
for a specific task. In this work, we show how a few-shot prompting training data [19, 30, 43, 49]. The general goal in model inversion
approach can be used to invert black-box code generation models. is to reconstruct a representative view of the input data based on
the models’ outputs[43]. Recently, Carlini et al. [9] showed that it
2.2 Large Language Models of Source Codes is possible to extract memorized data from large language models.
There is a growing interest in using large language models for These data include personal information such as e-mail, URLs, and
source code understanding and generation tasks [11, 20, 45]. Feng phone numbers. In this work, we use few-shot prompting to invert
et al. [17] and Guo et al. [23] propose encoder-only models with a the black-box cod models. Using the inverse of the code generation
variant of objective functions. These models [17, 23] primarily focus models, we automatically find the scenarios (prompts) that lead the
on code classification, code retrieval, and program repair. Ahmad models to generate vulnerable codes.
2
Systematically Finding Security Vulnerabilities in Black-Box Code Generation Models
also includes model-generated code, especially given the black- 5 def confirmAuth ( self , headers ) :
box nature and complexity of such models. In the following, we 6 try :
elaborate on recent analysis methods and classification schemes 7 token = cPickle . loads ( base64 . b64decode ( headers [ '
for code vulnerabilities and provide an overview of the evaluated AuthToken ']) )
code generation models. 8 if not check_hmac ( token [ ' signature '] , token [ ' data '
] , getSecretKey () ) :
raise AuthFail
3.1 Evaluating Security Issues 9
Table 1: List of evaluated CWEs. Eight of the twelve CWEs are in the top 25 list. The description is from [1].
CWE Description
CWE-020 Improper Input Validation
CWE-022 Improper Limitation of a Pathname to a Restricted Directory (“Path Traversal”)
CWE-078 Improper Neutralization of Special Elements used in an OS Command (“OS Command Injection”)
CWE-079 Improper Neutralization of Input During Web Page Generation (“Cross-site Scripting”)
CWE-089 Improper Neutralization of Special Elements used in an SQL Command (“SQL Injection”)
CWE-094 Improper Control of Generation of Code (“Code Injection”)
CWE-117 Improper Output Neutralization for Logs
CWE-327 Use of a Broken or Risky Cryptographic Algorithm
CWE-502 Deserialization of Untrusted Data
CWE-601 URL Redirection to Untrusted Site (“Open Redirect”)
CWE-611 Improper Restriction of XML External Entity Reference
CWE-732 Incorrect Permission Assignment for Critical Resource
F-1 ( ) ≈
I II III
F F
Black-box Code Black-box Code Security Security
...
...
Issues
...
...
Generation Model Generation Model Analyzer
...
...
...
Figure 4: Overview of our proposed approach to automatically finding security vulnerability issues of the code generation
models.
4.1.2 FS-Prompt. We investigate two other variants of our few- In the process of generating non-secure prompts which lead to
shot prompting approach. In Equation 4, we introduce FS-Prompt a specific type of vulnerability, we provide the few-shot input from
(Few-Shot-Prompt). the targeted CWE type. Specifically, if we want to sample “SQL
Injection” (CWE-089) non-secure prompts, we provide a few-shot
FS-Prompt: = F−1 ( ) ≈ F( , ..., ) (4) input with “SQL Injection” security vulnerabilities.
We investigate the effectiveness of each approach in approxi- 5.1.1 Code Generation Models. In our experiments, we focus on
mating F−1 to generate non-secure prompts by conducting a set of CodeGen with 6 billion parameters [31] and Codex [11] model
different experiments. with 12 billion parameters. We provide the details of each model in
Section 3.3. In addition to these two models, we also provide the
4.2 Sampling Non-secure Prompts and Finding results for the GitHub Copilot AI programming assistant [16].
Vulnerable Codes We conduct the experiments for CodeGen model using two
NVIDIA 40GB Ampere A100 GPUs. To run the experiments on
Using the proposed approximation of F−1 , we generate non-secure
Codex, we use the OpenAI API [2] to query the model. In the gen-
prompts that potentially lead the model F to generate codes with
eration process, we consider generating up to 25 and 150 tokens for
particular security vulnerabilities. Given the output distribution
non-secure prompts and code, respectively. We use beam search to
of the F, we sample multiple different non-secure prompts using
sample 𝑘 non-secure prompts from CodeGen. Using each 𝑘 sampled
a beam search algorithm [14, 44] and random sampling. Sampling
non-secure prompts, we sample 𝑘 ′ completion of the given input
multiple non-secure prompts allows us to find the models’ secu-
non-secure prompts. For the Codex model, we also set the number
rity vulnerabilities at a large scale. Lu et al. [27] show that the
of samples for generating non-secure prompts and code to 𝑘 and
order of examples in few-shot prompting affects the output of
𝑘 ′ , respectively. In total, we sample 𝑘 × 𝑘 ′ completed codes. For
the models. Therefore, to increase the diversity of the generated
both models, we set the sampling temperature to 0.7, where the
non-secure prompts, in FS-Code and FS-Prompt, we use a set of
temperature describes the randomness of the model’s output and,
few-shot prompts with permuted orders. We provide the details of
therefore, their variance. The higher the temperature, the more
the different few-shot prompt sets in section 5.
random the output, while 0.0 would always output the most likely
Given a large set of generated non-secure prompts and model F,
prediction. For the few-shot prompts, we use three different sources:
we generate multiple potentially vulnerable code samples and spot
the example provided in the dataset published by Siddiq and Santos
security vulnerabilities of the target model via static analysis. To
[40], examples provided by CodeQL [25], and published vulnerable
generate potentially vulnerable code using the generated non-
code examples by Pearce et al. [34].
secure prompts, we employ different strategies (e.g., beam search
algorithm) to sample a large set of different codes. 5.1.2 Constructing Few-shot Prompts. We use the few-shot setting
in FS-Code and FS-Prompt to guide the models to generate the
4.3 Confirming Security Vulnerability Issues of desired output. Previous work has shown that the optimal number
Identified Samples for the few-shot prompting is between two to ten examples [6, 8].
We employ our approach to sample a large set of non-secure prompts Due to the difficulty in accessing potential security vulnerability
( ), which can be used to generate a large set of code ( ) from code examples, we set the number to four in all of our experiments
the targeted model. Using the sampled non-secure prompts and for FS-Code and FS-Prompt.
To construct each few-shot prompt, we use a set of four examples
their completion, we can construct the completed code . To an- for each CWEs in Table 1. The examples in the few-shot prompts are
alyze the security vulnerabilities of the generated codes, we query separated using a special tag (###). It has been shown that the order
the constructed codes via CodeQL [25] to obtain a list of po- of examples affects the output [27]. To generate a diverse set of
tential vulnerabilities. non-secure prompts, we construct five few-shot prompts with four
6
Systematically Finding Security Vulnerabilities in Black-Box Code Generation Models
examples by randomly shuffling the order of examples. In total, for Quantitative Comparison of Different Prompting Techniques. Ta-
each type of CWE, we use four examples. Using these four examples, ble 2 and Table 3 provide the quantitative results of our approaches.
we construct five different few-shot prompts. Note that each of the The tables show the absolute numbers of vulnerable codes found
four examples contains at least one security vulnerability of the by FS-Code, FS-Prompt, and OS-Prompt for both models. Table 2
targeted CWE. Using the five constructed few-shot prompts, we presents the results for the codes generated by CodeGen, and Ta-
can sample 5 × 𝑘 × 𝑘 ′ completed codes from each model. ble 3 for the codes generated by Codex. Columns 2 to 13 provide
the number of vulnerable codes that contain specific CWEs, and
5.1.3 CWEs and CodeQL Settings. By default, CodeQL provides column fourteen (Other) provides the number of codes that contain
queries to discover 29 different CWEs in Python code. Here, we other CWEs that we do not consider during our evaluation (CodeQL
generate non-secure prompts and codes for 12 different CWEs, queried sixteen other CWEs). The last column provides the sum of
listed in Table 1. However, we analyzed the generated codes to all vulnerable codes.
detect all 29 different CWEs. We summarize all CWEs that are not In Table 2 and Table 3, we observe that our best performing
in the list in Table 1 but are found during the analysis as Other. method (FS-Code) found 186 and 481 vulnerable code samples that
are generated by CodeGen and Codex, respectively. In general, we
5.2 Evaluation observe that our approaches found more vulnerable codes that are
In the following, we present the evaluation results and discuss the generated by Codex in comparison to CodeGen. One reason for that
main insights of these results. could be related to the capability of the Codex model to generate
more complex codes compared to CodeGen [31]. Another reason
5.2.1 Generating Codes with Security Vulnerabilities. We evalu- might be related to the code datasets used in the model’s training
ate our different approaches for finding vulnerable codes that are procedure. Furthermore, Table 2 and Table 3 show that FS-Code per-
generated by the CodeGen and Codex models. We examine the forms better in finding codes with different CWEs in comparison to
performance of our FS-Code, FS-Prompt, and OS-Prompt in terms FS-Prompt and OS-Prompt. For example, in Table 3, we can observe
of quality and quantity. For this evaluation, we use five different that FS-Code find more vulnerable codes that contain CWE-020,
few-shot prompts by permuting the input order. We provide the CWE-022, and CWE-089. This again shows the advantage of em-
detail of constructing these five few-shot prompts using four code ploying vulnerable codes in our few-shot prompting approach. For
examples in subsection 5.1. Note that in one-shot prompts for OS- the remaining experiments, we use FS-Code as our best-performing
Prompt, we use one example in each one-shot prompt, followed by approach.
importing relevant libraries. In total, using each few-shot prompt or
one-shot prompt, we sample top-5 non-secure prompts , and each 5.2.2 Finding Security Vulnerabilities of Models on Large Scale.
sampled non-secure prompts is used as input to sample top-5 code Next, we evaluate the scalability of our FS-Code approach in finding
completion. Therefore using five few-shot or one-shot prompts, we vulnerable codes that could be generated by CodeGen and Codex
sample 5 × 5 × 5 (125) complete codes from CodeGen and Codex model. We investigate if our approach can find a larger number of
models. vulnerable codes by increasing the number of sampled non-secure
prompts and code completions. To evaluate this, we set 𝑘 = 15
Effectiveness in Generating Specific Vulnerabilities. Figure 6 shows (number of sampled non-secure prompts) and 𝑘 ′ = 15 (number of
the percentage of vulnerable codes that are generated by CodeGen sampled codes given each non-secure prompts). Using five few-shot
(Figure 6a, Figure 6b, and Figure 6c) and Codex (Figure 6d, Figure 6e, prompts, we generate 1125 (15 × 15 × 5) codes by each model and
and Figure 6f) using our three few-shot prompting approaches. We then remove all duplicate codes. Figure 7 provides the results for
removed duplicates and codes with syntax errors. The x-axis refers the number of codes with different CWEs versus the number of
to the CWEs that have been detected in the sampled code, and the samples. Figure 7a and Figure 7b provide results for twelve different
y-axis refers to the CWEs that have been used to generate non- CWEs. Note that in Figure 7a and Figure 7b, other indicates the
secure prompts. These non-secure prompts are used to generate other sixteen CWEs that CodeQL queries.
the analyzed code. Other refers to detected CWEs that are not listed Figure 7 shows that, in general, by sampling more code samples,
in Table 1 and are not considered in our evaluation. The results in we can find more vulnerable codes that are generated by CodeGen
Figure 6 show the percentage of the generated code samples that and Codex models. For example, Figure 7a shows that with sampling
contain at least one security vulnerability. The high numbers on more codes, CodeGen generates a significant number of vulnerable
the diagonal show our approaches’ effectiveness in finding code codes for CWE-022 and CWE-079. In Figure 7a and Figure 7b, we
with targeted vulnerabilities, especially for Codex. For CodeGen, also observe that generating more codes has less effect in finding
the diagonal is less distinct. However, we can also find a reason- more codes with specific vulnerabilities (e.g., CWE-020 and CWE-
ably large number of vulnerabilities for all three few-shot sampling 0732). Furthermore, Figure 7 shows an almost linear growth for
approaches. Overall, we find that our FS-Code approach (Figure 6a CWE-022, CWE-079, and CWE-327. This is mainly due to the nature
and Figure 6d) performs better in comparison to FS-Prompt (Fig- of these CWEs. For example, CWE-327 is related to using a broken
ure 6b and Figure 6e) and OS-Prompt (Figure 6c and Figure 6f). For or risky cryptographic algorithm. A group of broken cryptographic
example, Figure 6d shows that FS-Code finds higher percentages algorithms can be used in various codes where the model needs to
of CWE-020, CWE-078, and CWE-327 vulnerabilities for Codex encrypt a string or an input. We also qualified the provided results
models in comparison to our other approaches (FS-Prompt and in Figure 7 by employing fuzzy matching to drop near duplicate
OS-Prompt). codes. However, we did not observe a significant change in the
7
Hossein Hajipour, Thorsten Holz, Lea Schönherr, Mario Fritz
Figure 6: Percentage of the discovered vulnerable codes using the non-secure prompts that are generated for specific CWE. (a),
(b), and (c) provide the results of the generated code by CodeGen model using FS-Code, FS-Prompt, and OS-Prompt, respec-
tively. (d), (e), and (f) provide the results of the generated code by the Codex model using FS-Code, FS-Prompt, and OS-Prompt,
respectively.
Table 2: The number of discovered vulnerable codes that are generated by the CodeGen model using FS-Code, FS-Prompt, and
OS-Prompt methods. Columns two to thirteen provide results for different CWEs (see Table 1). Column fourteen provides the
number of found vulnerable codes with the other sixteen CWEs that are queried by CodeQL. The last column provides the
sum of all codes with at least one security vulnerability.
Methods CWE-020 CWE-022 CWE-078 CWE-079 CWE-089 CWE-094 CWE-117 CWE-327 CWE-502 CWE-601 CWE-611 CWE-732 Other Total
FS-Codes 4 19 4 25 3 0 15 45 4 11 12 12 32 186
FS-Prompts 0 22 1 27 4 0 7 45 6 6 3 4 16 141
OS-Prompt 10 28 2 40 1 0 6 20 2 1 7 1 27 145
Table 3: The number of discovered vulnerable codes that are generated by the Codex model using FS-Code, FS-Prompt, and
OS-Prompt methods. Columns two to thirteen provide results for different CWEs (see Table 1). Column fourteen provides the
number of found vulnerable codes with the other sixteen CWEs that are queried by CodeQL. The last column provides the
sum of all codes with at least one security vulnerability.
Methods CWE-020 CWE-022 CWE-078 CWE-079 CWE-089 CWE-094 CWE-117 CWE-327 CWE-502 CWE-601 CWE-611 CWE-732 Other Total
FS-Codes 8 43 21 81 25 20 39 82 23 29 41 22 47 481
FS-Prompts 1 40 8 79 10 4 41 50 49 28 18 1 39 370
OS-Prompt 0 47 3 49 5 3 16 55 12 13 5 9 35 252
effect of sampling the codes on finding the number of vulnerable to 8 are used as the non-secure prompt, and the rest of the code
codes. We provide more details and result in Appendix A. example is the completion for the given non-secure prompt. The
code contains a vulnerability in line 14, where the code deserial-
Qualitative Examples. Figure 8 and Figure 9 provide two exam- izes data without sufficiently verifying the data. In Figure 9, lines
ples of vulnerable codes generated by CodeGen and Codex, respec- 1 to 9 are the non-secure prompt, and the rest of the code is the
tively. These two codes contain a security vulnerability of type output of Codex given the non-secure prompt. The code contains a
CWE-502 (deserialization of untrusted data). In Figure 8, lines 1 vulnerability of type CWE-502 in line 14.
8
Systematically Finding Security Vulnerabilities in Black-Box Code Generation Models
(a) (b)
Figure 7: The number of discovered vulnerable codes versus the number of sampled codes generated by (a) CodeGen, and (b)
Codex. The non-secure prompts and codes are generated using our FS-Code method.
Figure 8: A vulnerable code example generated by CodeGen. 18 # use latitude / longitude to determine
The code contains CWE-502 vulnerability in line 14. In this temperature
example, the first eight lines are the non-secure prompts , 19 if latitude is None or longitude is None :
and the rest of the code is the completion of the given non- 20 temperature = " unknown location "
secure prompts. 21 elif abs ( latitude - 35) < 10:
22 temperature = ' hot '
23 else :
24 temperature = " cold "
5.2.3 Transferability of the Generated Non-secure Prompts. In the 25
previous experiments, we generated the non-secure prompts and 26 # return temperature
completed codes using the same model. Now we investigate if the 27 return temperature
generated non-secure prompts are transferable among different
models. For example, we want to answer the question of whether Figure 9: A vulnerable code example generated by Codex.
the generated non-secure prompts by Codex can lead the CodeGen The code contains CWE-502 vulnerability in line 14. In this
model to generate vulnerable codes. For this experiment, we collect example, the first nine lines are the non-secure prompts ,
a set of “promising” non-secure prompts that we have generated and the rest of the code is the completion of the given non-
with the CodeGen and Codex models in Section 5.2.2. We consider secure prompts.
a non-secure prompt as promising if it at least leads the model to
generate one vulnerable code. After de-duplication, we collected
381 of the non-secure prompts that were generated by the CodeGen
9
Hossein Hajipour, Thorsten Holz, Lea Schönherr, Mario Fritz
Table 5: The number of discovered vulnerable codes that are generated by the Codex model using FS-Code. Columns two to
four provide results for different CWEs (see Table 1). Column five provides the number of discovered vulnerable codes with the
other CWEs that are queried by CodeQL. The last column provides the sum of all codes with at least one security vulnerability.
1 from flask import Flask , request , make_response generate a significant number of codes with CWEs if they are used
2 def no_template_engine_replace () : with another model. Specifically, we have found that non-secure
3 ''' prompts sampled via Codex can even find a higher fraction of
4 get the param data from request vulnerabilities generated via CodeGen.
5 open html file and replace the name with the In our experiments with GitHub Copilot, we have shown that
param data our few-shot prompting approach also works for commercial black-
6 return the html file as reponse box models and specifically for the model that is already used by
7 ''' millions of developers. Also, this indicates that vulnerabilities in
8 param = request . args . get ( ' param ' , ' ') automatically generated codes are not solely an academic prob-
9 with open ( ' index . html ', 'r ') as f : lem but already an issue that needs to be considered during the
10 html = f . read () development of AI-assisted pair programming tools. Even though
11 html = html . replace ( ' name ', param ) a developer will take care that their code is as secure as possible,
12 response = make_response ( html ) they cannot check for all cases, and utilizing a model that will not
13 response . headers [ ' Content - Type '] = ' text / html ' generate or suppress vulnerable code can already prevent a lot of
14 return response potential harm and unpredictable costs.
REFERENCES Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang,
[1] 2022. CWE - Common Weakness Enumeration. (2022). https://cwe.mitre.org and Ming Zhou. 2021. GraphCodeBERT: Pre-training Code Representations with
[2] 2022. OpenAI APIs. (2022). https://beta.openai.com/docs/introduction, as of Data Flow. In ICLR.
February 9, 2023. [24] Saki Imai. 2022. Is GitHub copilot a substitute for human pair-programming? An
[3] 2022. TheFuzz. (2022). https://github.com/seatgeek/thefuzz, as of February 9, empirical study. In Proceedings of the ACM/IEEE 44th International Conference on
2023. Software Engineering: Companion Proceedings. 319–321.
[4] Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. [25] GitHub Inc. 2022. GitHub CodeQL. (2022). https://codeql.github.com/
Unified Pre-training for Program Understanding and Generation. In NAACL. [26] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi
[5] Nathaniel Ayewah, William Pugh, David Hovemeyer, J. David Morgenthaler, and Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas
John Penix. 2008. Using Static Analysis to Find Bugs. IEEE Software 25, 5 (2008), Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen,
22–29. Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy,
[6] Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel. 2022. Code Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas,
Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition-level code generation
Models on Code. arXiv (2022). with AlphaCode. Science 378, 6624 (2022), 1092–1097. https://doi.org/10.1126/
[7] Moritz Beller, Radjino Bholanath, Shane McIntosh, and Andy Zaidman. 2016. science.abq1158 arXiv:https://www.science.org/doi/pdf/10.1126/science.abq1158
Analyzing the State of Static Analysis: A Large-Scale Evaluation in Open Source [27] Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp.
Software. In 2016 IEEE 23rd International Conference on Software Analysis, Evolu- 2022. Fantastically Ordered Prompts and Where to Find Them: Overcoming
tion, and Reengineering (SANER), Vol. 1. 470–481. Few-Shot Prompt Order Sensitivity. In ACL.
[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, [28] Aravindh Mahendran and Andrea Vedaldi. 2021. Understanding deep image
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda representations by inverting them. In CVPR.
Askell, et al. 2020. Language models are few-shot learners. In NeurIPS. [29] Spyridon Mouselinos, Mateusz Malinowski, and Henryk Michalewski. 2022. A
[9] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Simple, Yet Effective Approach to Finding Biases in Code Generation. arXiv
Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, preprint arXiv:2211.00609 (2022).
Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Lan- [30] Yuta Nakamura, Shouhei Hanaoka, Yukihiro Nomura, Naoto Hayashi, Osamu
guage Models. In USENIX Security Symposium. Abe, Shuntaro Yada, Shoko Wakamiya, and Eiji Aramaki. 2020. Kart: Privacy
[10] George Chatzieleftheriou and Panagiotis Katsaros. 2011. Test-Driving Static leakage framework of language models pre-trained with clinical records. arXiv
Analysis Tools in Search of C Code Vulnerabilities. In 2011 IEEE 35th Annual (2020).
Computer Software and Applications Conference Workshops. 96–103. https://doi. [31] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou,
org/10.1109/COMPSACW.2011.26 Silvio Savarese, and Caiming Xiong. 2022. CodeGen: An Open Large Language
[11] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Ka- Model for Code with Multi-Turn Program Synthesis. arXiv (2022).
plan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, [32] OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. (Nov. 2022).
Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela https://openai.com/blog/chatgpt/, as of February 9, 2023.
Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, [33] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela
Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Pet- Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.
roski Such, David W. Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Training language models to follow instructions with human feedback. arXiv
Barnes, Ariel Herbert-Voss, William H. Guss, Alex Nichol, Igor Babuschkin, (2022).
S. Arun Balaji, Shantanu Jain, Andrew Carr, Jan Leike, Joshua Achiam, Vedant [34] Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and
Misra, Evan Morikawa, Alec Radford, Matthew M. Knight, Miles Brundage, Mira Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub
Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCan- Copilot’s Code Contributions. In S and P. https://doi.org/10.1109/SP46214.2022.
dlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language 9833571
Models Trained on Code. arXiv (2021). [35] Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Bren-
[12] Maria Christakis and Christian Bird. 2016. What Developers Want and Need dan Dolan-Gavitt. 2022. Examining Zero-Shot Vulnerability Repair with Large
from Program Analysis: An Empirical Study. In Proceedings of the 31st IEEE/ACM Language Models. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE
International Conference on Automated Software Engineering (ASE ’16). Association Computer Society, 1–18.
for Computing Machinery, New York, NY, USA, 332–343. https://doi.org/10.1145/ [36] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
2970276.2970347 Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2020. Exploring the
[13] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, limits of transfer learning with a unified text-to-text transformer. JMLR (2020).
Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling [37] Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Brendan Dolan-
instruction-finetuned language models. arXiv (2022). Gavitt, and Siddharth Garg. 2022. Security Implications of Large Language Model
[14] Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G Schwing, and David Code Assistants: A User Study. arXiv preprint arXiv:2208.09727 (2022).
Forsyth. 2019. Fast, Diverse and Accurate Image Captioning Guided By Part-of- [38] Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino,
Speech. In CVPR. Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel,
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: and Giovanni Vigna. 2016. SoK: (State of) The Art of War: Offensive Techniques
Pre-training of Deep Bidirectional Transformers for Language Understanding. In in Binary Analysis. In IEEE Symposium on Security and Privacy (SP).
NAACL. [39] Mohammed Latif Siddiq, Shafayat Hossain Majumder, Maisha Rahman Mim,
[16] Thomas Dohmke. 2022. GitHub Copilot is generally available to all develop- Sourov Jajodia, and Joanna CS Santos. 2022. An Empirical Study of Code Smells
ers. (June 2022). https://github.blog/2022-06-21-github-copilot-is-generally- in Transformer-based Code Generation Techniques. In SCAM.
available-to-all-developers/, as of February 9, 2023. [40] Mohammed Latif Siddiq and Joanna CS Santos. 2022. SecurityEval dataset: mining
[17] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, vulnerability examples to evaluate machine learning-based code generation
Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: techniques. In MSR4P and S.
A Pre-Trained Model for Programming and Natural Languages. In EMNLP. [41] László Szekeres, Mathias Payer, Tao Wei, and Dawn Song. 2013. SoK: Eternal
[18] Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++ War in Memory. In IEEE Symposium on Security and Privacy.
: Combining Incremental Steps of Fuzzing Research . In USENIX Workshop on [42] Parth Thakkar. 2022. Copilot Internals. (2022). https://thakkarparth007.github.
Offensive Technologies (WOOT). io/copilot-explorer/posts/copilot-internals, as of February 9, 2023.
[19] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. 2015. Model inversion [43] Kuan-Chieh Wang, YAN FU, Ke Li, Ashish Khisti, Richard Zemel, and Alireza
attacks that exploit confidence information and basic countermeasures. In ACM Makhzani. 2021. Variational Model Inversion Attacks. In NeurIPS.
CCS. [44] Liwei Wang, Alexander Schwing, and Svetlana Lazebnik. 2017. Diverse and
[20] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Accurate Image Description Using a Variational Auto-Encoder with an Additive
Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. Incoder: A Gaussian Encoding Space. In NeurIPS.
generative model for code infilling and synthesis. arXiv (2022). [45] Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. 2021. CodeT5:
[21] Anjana Gosain and Ganga Sharma. 2015. Static analysis: A survey of techniques Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Under-
and tools. In Intelligent Computing and Applications. Springer, 581–591. standing and Generation. In EMNLP.
[22] Katerina Goseva-Popstojanova and Andrei Perhinschi. 2015. On the capability of [46] Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022.
static code analysis to detect security vulnerabilities. Information and Software A Systematic Evaluation of Large Language Models of Code. In Proceedings
Technology 68 (2015), 18–33. of the 6th ACM SIGPLAN International Symposium on Machine Programming
[23] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long (MAPS 2022). Association for Computing Machinery, New York, NY, USA, 1–10.
Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun https://doi.org/10.1145/3520312.3534862
12
Systematically Finding Security Vulnerabilities in Black-Box Code Generation Models
[47] Hongxu Yin, Pavlo Molchanov, Jose M. Alvarez, Zhizhong Li, Arun Mallya, Derek A SECURITY VULNERABILITY RESULTS
Hoiem, Niraj K. Jha, and Jan Kautz. 2020. Dreaming to distill: Data-free knowledge
transfer via deepinversion. In CVPR. AFTER FUZZY CODE DE-DUPLICATION
[48] Li Yujian and Liu Bo. 2007. A normalized Levenshtein distance metric. TPAMI We employ TheFuzz [3] python library to find near duplicate codes.
(2007).
[49] Ruisi Zhang, Seira Hidano, and Farinaz Koushanfar. 2022. Text Revealer: Private This library uses Levenshtein Distance to calculate the differences
Text Reconstruction via Model Inversion Attacks against Transformers. arXiv between sequences [48]. The library outputs the similarity ratio of
(2022).
[50] Shuyin Zhao. 2022. GitHub Copilot is generally available for businesses. (Dec.
two strings as a number between 0 and 100. We consider two codes
2022). https://github.blog/2022-12-07-github-copilot-is-generally-available-for- duplicates if they have a similarity ratio greater than 80. Figure 12
businesses/, as of February 9, 2023. provides the results of our FS-Code approach in finding vulnerable
codes that could be generated by CodeGen and Codex model. Note
that these results are provided by following the setting of Section
5.2.2. Here we also observe a general almost-linear growth pattern
for different vulnerability types that are generated by CodeGen and
Codex models.
13
Hossein Hajipour, Thorsten Holz, Lea Schönherr, Mario Fritz
(a) (b)
Figure 12: The number of discovered vulnerable codes versus the number of sampled codes generated by (a) CodeGen, and
(b) Codex. The non-secure prompts and codes are generated using our FS-Code method. While Figure 7 already has removed
exact matches, here, we use fuzzy matching to do code de-duplication.
14