Security Response

Portable Document Format Malware
Kazumasa Itabashi Contents

Introduction
Approximately two years ago a vulnerability in Adobe Reader’s JavaScript API was discovered, and malware authors continue to produce malicious PDF files that exploit this flaw. This vulnerability has been patched, though a number of other vulnerabilities have been found and used in active exploits before being patched themselves. There are numerous reasons why malware authors might use vulnerabilities in Adobe Reader and Acrobat as an attack vector. First, the PDF format is widely used throughout the world for sharing documents, and Adobe Reader is the most popular PDF viewer; many OEMs ship PCs with the software preinstalled. Second, the PDF file format specification and the properties of the viewer allow malware authors a significant degree of freedom when designing and developing a threat. Third, the nature of the PDF format provides malware authors with some useful tricks that help to avoid detection by AV scanners, and the support for JavaScript further extends this capability. Obfuscation, encryption, and misdirection are techniques often employed in a similar manner to how they may be seen in HTML and other environments that support JavaScript. This paper aims to detail the different paths malware authors have taken and point out how attack techniques via PDF have evolved. It is hoped that it will aid AV vendors and PC users alike in better understanding the problems posed by malicious PDFs, as well as the importance of staying up-to-date with patches.

Introduction ....................................................... 1 Background Information ................................... 1 Obfuscation Using Features of JavaScript ........ 2 Obfuscation Using Features of PDF Format ...... 7 Encryption ........................................................ 10 JavaScript Features Unique to PDF ................. 11 Conclusion........................................................ 15 Bibliography ..................................................... 16

Background Information
The first JavaScript-based PDF malware came to light in February 2008. A vulnerability in one of Adobe’s JavaScript API functions, collectEmailInfo(), was discovered and used in conjunction with a heap-spray attack. The

Figure 2 shows an example of this. By using <iframe> or <script> tags. The aims of the exploits and the update-release cycle are echoed in the PDF world. The JavaScript object used to represent regular expressions is called RegExp. which transforms the string into the final binary form. Backdoor. For an AV scanner to detect this shellcode it must include both a lexical and a structural parser. In November 2008 – nine months later – another vulnerability was found. The downloaded malware may be categorized as a Trojan horse. and to date malware exploiting vulnerabilities in the following functions has been found: • collectEmailInfo() • printf() • getIcon() • customDictionaryOpen() • getAnnots() • newPlayer() The exploits are very similar to those that might be found in attacks on Web browsers. or Infostealer. with AV vendors forced to be aware of the ever-changing face of PDF-based malware. Similar techniques can be used within PDF files. for example. They can be used for pattern-matching and programmatic text manipulation. Although it is broken down into substring “components” it is evaluated as a single string. A string can also be used by calling the object method using the bracket notation. Obfuscation Using Features of JavaScript Most JavaScript can be easily obfuscated courtesy of features of the language. In order to perform these kinds of exploits. for example detecting line breaks or validating input characters. Page 2 .Security Response Portable Document Format Malware malicious code copied shellcode to heap memory and subsequently called the vulnerable API. thus exploiting the vulnerability. Regular Expressions JavaScript supports regular expressions as a built-in language feature. Web-based JavaScript attacks commonly make use of HTML features to obfuscate the code effectively. and as such the following sections detail some ways in which malicious JavaScript may be obfuscated by malware authors. Split Strings Dynamic string manipulation is as easy in JavaScript as it is in other interpreted script-based languages. The string literals and variables are concatenated and evaluated as arguments to the unescape() function. this time in the printf() API. the same techniques may be used to craft malicious code that evades detection. This was fairly simple JavaScript in and of itself. malware authors must periodically update the JavaScript used or risk detection and hence the failure of the drive-by download. Simple Obfuscation Even beginner or unsophisticated JavaScript programmers can make use of simple string obfuscation techniques. Several further vulnerabilities emerged over time. While there are legitimate uses of code obfuscation. malware authors can make it more difficult for AV products to detect the malicious code. some of which are defined as variables. strings (or strings and numbers) may be concatenated using the “+” character. it also makes it easy for malware authors to create simply obfuscated code. While this is useful for the easy manipulation of textual content. JavaScript-based malware is typically used to trigger drive-by downloads on the Web and cause further malware to be downloaded on to users’ computers. Often. Figure 1 shows a shellcode block that has been split into many shorted strings. Unfortunately both Weband PDF-based attacks continue because of the myriad methods that may be used to obfuscate the code and hence evade detection by security software.

Security Response Portable Document Format Malware Figure 1 Simple string concatenation Figure 2 Property access using the bracket notation Regular expressions are an effective method of string obfuscation used by malware authors. as in the expression “%25%34%35%3Z%3Z%3Z%66”. “0”). This is a simple example in which a single character was added to each 2-byte hexadecimal number. The characters that comprise the string to be obfuscated can be “scattered” throughout a longer string and retrieved using a regular expression when they are to be used. The use of regular expressions can yield more complex obfuscation than simple split strings. yielding Obfuscation using a regular expression %25%34%35%30%30%30%66. Each instance of l. Page 3 .alert ( “Hello World” ). k. A legitimate use of this function is when dynamically generated code is to be used. for example by adding more characters. using a more complex sequence. The eval Function JavaScript provides a global function called eval() that may be used to evaluate a string as though it were an expression. This function is one of the most effective ways through which malware authors can produce obfuscated code. Figure 3 shows this technique in use. however. This string is then evaluated using the unescape() function. The following two JavaScript statements produce the same result: a message box that displays the text “Hello World”: • app.’ ). u and d in the obfuscated string is reFigure 3 placed with the % character. giving the final result of %45000f. It also tends to be used to hide strings in conjunction with other techniques. In conjunction with the use of split strings and regular expressions most recognizable JavaScript code can be obfuscated.replace(new RegExp(/Z/g).alert ( “Hello World” ). or by replacing parts of each hexadecimal number. • eval ( ‘app. The technique can be made more complex. producing results similar to the example in figure 4.

Figure 6 shows how this can be achieved. An alternative method is to use a numeric representation to produce the desired string. however. 7 + 29) representation: 693741 = 14 X 363 + 31 X 362 + 10 X 36 + 21 Page 4 . the first argument must be a reference to a function but the PDF format allows any code to be specified.setTimeOut(statement. alternatives to the eval() function are available. In the PDF format. timeout) executes the statement given as its first argument after the time (in milliseconds) given as its second. The string is the final element of the array whose first element is “oibj”.e. Figure 5 eval() in an array Figure 6 Numeric eval() Arrays are evaluated from left to right2 so the array is evaluated last and the statement is equivalent to qkgd=(“yeid”.”ngir”)[“eval”]. the variable ikhircrro has the value 693741.Security Response Portable Document Format Malware Figure 4 How many eval()s? One downside to using the eval() technique for code obfuscation is that most malcode researchers would likely begin their search for malicious code by looking for this keyword. Figure 5 shows an example of a split eval(). The next line converts this to a string by treating it as a radix-36 (i. … . The function app. This means that the variable qkgd is equivalent to eval and can be called as such. This allows for further obfuscation. Following the addition. In the Web-based world. which can be evaluated as method calling using the bracket notation.

eval(). This mode of operation is unFigure 7 likely to appear in non-malicious code. encrypt. deflate. 31 is “v”. The unescape() function As previously mentioned. unescape().Security Response Portable Document Format Malware If radix-36 is used to represent from 0 to 9 and A to Z inclusive. PDF-based malware makes use of many kinds of packer. and thus the variable lfbhmy represents “eval”. The unescape() function is able to deal with strings that decode to non-ASCII values and therefore is commonly used in heap-spray attacks. an example discovered in October 2008. Base64 Base64 encoding3 is used in numerous places on the Internet to represent arbitrary binary data using only the US-ASCII character set. Figure 8 Implementation of Base64 Page 5 . 14 is “e”. unescape() can decode from a hexadecimal representation to raw binary data. as in the example in figure 7. as seen in figure 8. but it can also be used as a method of obfuscation when the decoded results are another ASCII string. Obfuscation Using Packers The eval() function is commonly used by packer designers. or multiplex a combination of different transformations. the original JavaScript is represented in a different form and with a different code size. Malware authors have produced a JavaScript Base64 decode implementation in order to decode base64 representations of malicious code on the fly. Following operations to inflate. and replace() Malware authors often use the unescape() function in conjunction with the replace() method to obfuscate code. 10 is “a” and 21 is “l”.

this example was first discovered in March 2009.callee in order to complicate the process of analysis. Page 6 . the Neosploit packer generates the key from the decryption function itself using arguments. numerical digits. “/” and “=”. An example implementation of RC4 appears in figure 9. often alter these standards in their implementations. The order of the index table is fixed as a result of the need to be able to encode and decode consistently.Security Response Portable Document Format Malware The original Base64 specification has a fixed index table that includes the alphabet. Packers also exist that use Base64 in conjunction with XOR or ADD operations. however. Malware authors. The packer uses a fairly simple substitution encrypt/decrypt algorithm but uses a method of key generation that had not previously been seen. RC4 Encryption RC4. One such packer has been discovered in use within a sample of PDF malware. This code was first discovered in January 2009. and the characters “+”. Note that the decryption key must appear in the decryption code. While most packers include the decryption key within the code in plaintext. the decryption code uses the RC4 decryption algorithm to decrypt and subsequently execute the malicious code body. a powerful stream cipher. either reordering the index table or changing the characters. the toolkits commonly use packers to prevent their code from being detected by AV scanners. a potential area of weakness. Figure 9 JavaScript RC4 implementation An Example Packer from Neosploit There are many toolkits available to perform Web browser exploits using JavaScript. Operating on a previously encrypted block of ciphertext. Example code from the Neosploit packer appears in figure 10. is one of the methods of encryption that has been used within packers.

for example.Security Response Portable Document Format Malware Figure 10 Example code from Neosploit packer Overpacking using Multiple Packers There are currently over 30 known types of JavaScript packer.followed by a version number of the form 1. This statement can be seen to be somewhat ambiguous. “%PDF” is not at the beginning of the file: Page 7 . Malware authors may also take advantage of PDF file format features in order to obfuscate malicious code. There are many examples of JavaScript code that has been packed multiple times using different packers. for example. Using the File Header Many file formats make use of a file header or “magic number” to identify the file type. Although PDF files commonly have “%PDF” at the beginning of the file. use “MZ”. bitmap files have “BM”. this need not always be the case. These methods will be outlined in this section. Obfuscation Using Features of the PDF Format The previous section outlined various methods of obfuscating JavaScript using packers. Windows executables. which usually is simply a few bytes at the beginning of the file. The PDF specification contains the following description: The first line of a PDF file shall be a header consisting of the 5 characters %PDF. where N is a digit between 0 and 71.N. and so on. Some samples of PDF malware have been observed to have malformed file headers.

the PDF format supports the inclusion of stream data with encoding and/or compression. font data.Security Response Portable Document Format Malware “%PDF“ appears on the third line. It was thought that random access to PDF objects did not present a problem for AV scanners but the discovery of malicious PDFs with invalid offsets changed this perception. this is a problem for those developing AV scanners. Cross-reference Table Many legitimate PDF files contain cross-references. The specification contains the following text: The cross-reference table contains information that permits random access to indirect objects within the file so that the entire file need not be read to locate any particular object. Figure 11 PDF stream encoding and compression Page 8 . and additionally copes with files whose first byte is not “%”. This flexibility creates the possibility of a performance issue for AV vendors: merely scanning the first four bytes is not sufficient to identify a file as a PDF that may be opened using Adobe Reader. not the first: “%PDF“ appears following “MZ”. Stream Filters In order to encapsulate large objects such as images. PDF malware has been found using the following filters to hide malicious JavaScript: • ASCIIHexDecode • ASCII85Decode • LZWDecode • FlateDecode • RunLengthDecode • JBIG2Decode Thus. as it may be necessary to parse an entire file to ascertain whether or not it contains malicious code. six of the ten standard filters defined in the PDF specification have been used for malicious purposes. While PDF readers can parse PDF files from the beginning even if invalid offsets are present. and so on. Adobe Reader is able to load and parse the contents of files even if the first four bytes are not “%PDF“. To date. as seen in figure 11. similar to a Windows executable: The “%” character delimits the beginning of a comment in a PDF file.

many such files are dynamically produced using server-side polymorphic techniques when PDF vulnerabilities are targeted for driveby downloads. Figure 13 is an example taken from a malicious PDF discovered in April 2008. The FlateDecode filter decompresses data that has been compressed using the zlib/deflate algorithm. which results in an invalid length field. Endstream or Endstrebm? As describe above. decompress and decode filters were used in conjunction. Malicious PDFs often have invalid length values. Figure 12 shows an example of the use of multiple stream filters. Adobe Reader is able to read stream content even if the length field is incorrect. although if it is missing they may use the stream and endstream keywords that mark the beginning and end of a stream respectively. An invalid length value of 0000 can be seen. The stream contained malicious JavaScript and was compressed using zlib. stream and endstream respectively mark the beginning and end of a stream. This raised the bar for AV vendors who found themselves requiring scanners that could decompress/decode all filters in order to scan the content of the streams. PDF reader applications may read the stream contents based on this field. Perhaps surprisingly Adobe Reader was able to recognize the stream data and perform the decompression before falling foul of the exploit contained within. all entries such as “/Type” or “/Action” should be identified in a case-sensitive manner: PDF is case-sensitive.. The malicious Figure 13 JavaScript is modified whereas Invalid “length” value the other PDF content is not. As detailed in the PDF specification. Page 9 . and minor changes to the code will result in major differences in the compressed version. In March 2010. Figure 12 Multiple stream filters Stream Length Each stream includes a length field that holds the number of bytes that comprise the stream.Security Response Portable Document Format Malware Combining Filters Initially only malware that made use of one filter type was found. corresponding uppercase and lowercase letters shall be considered distinct. Case Sensitivity According to the PDF specification. although this may not be intentional on the malware authors’ parts. In September 2008 a sample was found that used endstrebm instead of endstream. This method can both shorten file length and hide JavaScript content. malware that made use of multiple PDF filters was discovered.

but most malicious PDFs are served via the Web and aim to download files without users’ knowledge of what is happening. Users must enter the correct password to perform either operation.Security Response Portable Document Format Malware In January 2010. Most PDF files have no user password so they can be opened and read by anyone but when a file that has a non-blank user password is opened the PDF reader will display a dialog box to allow the password to be entered. all strings and streams in the file will be in ciphertext. Password Validation Errors Both user and owner passwords are represented as 32-byte strings in the U(ser) and O(wner) dictionaries respectively. User passwords are mainly used as a way to prevent PDF content from being displayed. Either the RC4 or AES algorithm may be used. Once a PDF is encrypted. Empty strings are acceptable passwords and as such an encrypted PDF with an empty password string may be displayed in any reader. Any PDF parser that – in accordance with the spec – identified these entries in a case-sensitive manner required an update to use case-insensitive identification. With stealth being of primary importance. some samples were discovered that contained these such entries appearing in a seemingly random mixture of upper and lower case characters. whereas owner passwords are used to prevent content modification.1. which does not affect the display of a file or its potential ability to deliver an exploit. however. Page 10 . The latter option leaves AV vendors at a disadvantage performance-wise as any such PDF files must be decrypted before the content is scanned. Figure 14 Variations in case Encryption The PDF format has supported encryption since version 1. keys differ between objects because object ID and generation number are used for key generation. Malicious PDFs tend to be produced with empty owner passwords. which provides document creators control over their work. however. and two forms of password are available: user password and owner password. All strings and/or streams in the same object are encrypted using the same key. the encryption key is constructed using the following parameters: • 32-byte string based on user password • 32-byte string based on owner password • User access permission flag • Document ID • Object ID • Generation number Within a PDF. RC4 and AES When a PDF is to be encrypted using RC4 or AES. affect how a file may open in an application that allows PDFs to be modified and also necessitates decryption when analysis is required. Figure 14 shows an example taken from the sample. malware authors typically only have two options: a plain PDF file or an encrypted PDF file with no password. Having the ability to encrypt PDFs would initially seem to be something that could be leveraged by malware authors to evade detection by AV scanners. This encryption does.

An excerpt from the file appears in figure 15. The name dictionary has the following functionality: When the document is opened. The name dictionary can contain multiple entries. The first sample found that used this fragmental JavaScript technique was discovered in August 2009. A number of different object types can be referred to in this way. This description from the PDF specification describes the mechanism through which fragmental JavaScript objects can be executed when a PDF file is opened. Figure 15 Fragmental JavaScript Page 11 . named as “name1”.Security Response Portable Document Format Malware Fragmental JavaScript A JavaScript object in a PDF file can be split up or fragmented using the name dictionary. each of which defines a similar relationship. Each JavaScript fragment may additionally be compressed or encrypted which means that AV scanners must perform the inverse of these operations in order to check for malicious content. one to perform heap-spraying (figure 16) and the other to deliver the exploit decrypt shellcode (figure 17). Two JavaScript objects appear in the file. The original functionality of the name dictionary was to allow an object to be referred to by name rather than by object reference. An entry can be set as follows: /Names [ (name1) 35 0 R (name2) 36 0 R …] The above text defines a reference to object 35. JavaScript fragments must be gathered together and evaluated together. the name “name2” is also associated with object 36 in the example above. all of the actions in this name tree shall be executed. defining JavaScript functions for use by other scripts in the document.

shellcode. The JavaScript that makes use of this technique appears to be legitimate because the main malicious code body is not visible on the surface. This increases the time and amount of memory it takes to scan a particular file. This may be further encrypted JavaScript. app. This section details how such features may be used by a malware author. In order to scan for malicious code.getField() The PDF JavaScript API has a built-in function called getField().alert (“Your first name is “ + firstName).Security Response Portable Document Format Malware Figure 16 Heap spray and exploit Figure 17 Shell Decryption JavaScript Features Unique to PDF JavaScript that may be used within a PDF file has a number of unique features. Page 12 . This example shows how JavaScript can retrieve user input from a text entry widget. AV vendors must develop scanners that gather together all related objects and reconstruct them.value. as in the following example: var firstName = this. such as the ability to make use of Acrobat forms for user input. only PDF object references are evident. Use of this.getField(“Name. the main purpose of which is to retrieve data from the Field object of an individual widget. Malicious JavaScript can be split up in a PDF file with the malicious code body being placed inside a PDF object or objects.First”). or any other malicious code.

doc. The hidden JavaScript code is packed as escaped characters but can be executed by using unescape() and eval(). in which getField() takes the string “data” as an argument. and clearly is a string of escaped characters. Figure 18 Use of getField() Figure 19 shows the target object referenced in the example above. An example of how this function may be used to hide malicious code appears in figure 20. Figure 20 Use of app. Figure 19 Field widget Use of app.doc.Security Response Portable Document Format Malware In November 2008 a sample was discovered that hides a segment of code to a Field object and later uses getField() to retrieve it.doc. This function allows data to be retrieved from a ScreenAnnot object. The type of the object is “/Widget” and its text label is “data”. The example in figure 18 shows the use of getField() in the sample.getAnnots() The app. as detailed earlier in this document (see figure 7). The malicious JavaScript content exists in the “/DV” entry as a string. which matches the string used in the JavaScript object.getAnnots() Page 13 .getAnnots() function is built-in to the PDF JavaScript API and operates in a similar manner to getField(). outlined above.

Malware authors can use the document information dictionary to store hidden malicious JavaScript in a similar way to the methods detailed above. is a ScreenAnnot that contains a reference to a further 7th object. An excerpt from the code appears in figure 23. and so on. the 7th object in the document. replaces all instances of “j866p886a39” with “%” and then uses the now-familiar unescape() and eval() operations on the document “title”. is a stream object that contains escaped malicious JavaScript.title to hold malicious JavaScript Page 14 .info. A 70448-byte string masquerading as the document title was present in the file. Figure 23 Use of this. producer. To execute the hidden code the threat retrieves the title string.Producer and this. Figure 22 Stream data referred to by screenAnnot object A sample making use of this technique was discovered in November 2009. Finally.info.Title The PDF Info object contains document meta-data such as the title. visible in figure 22. this string contained obfuscated and escaped JavaScript code. visible in figure 21. Figure 21 ScreenAnnot object Use of this.Security Response Portable Document Format Malware The 6th object in the document.info.

This may help to contain even malware that uses new or previously unknown techniques. For now it is essential to keep software patches and virus definitions up-to-date and for antivirus vendors to strive to keep pace with the tricks and techniques deployed by the malware authors. and time will tell how successful such an approach may be. the complexity and flexibility of the PDF file format mean that malware authors are continually pushing the envelope and as such AV vendors must continue to improve and refine their PDF parsing technology. Sandboxing technology is not the perfect solution to all problems however. but the features and specification of the PDF file format mean that a number of additional tricks are available to the malware author. Some good news is that Adobe has introduced sandboxing functionality into Reader during 2010. It is crucial for AV vendors to exercise caution when adding definitions so as to avoid the disruption that may be caused when a legitimate file is falsely convicted. Page 15 . The possibility of false positives exists as a result of toolkits that may be used to craft both legitimate and malicious PDFs alike. The cat-and-mouse game between AV vendors and malware authors continues. the introduction of such sandbox technology may also bring with it new vulnerabilities to be exploited.Security Response Portable Document Format Malware Conclusion PDF-based malware can harbor malicious JavaScript in a similar manner to how it may exist on the Web.

Base32. “PDF Reference and Adobe Extensions to the PDF Specification.com/devnet/pdf/ pdf_reference.org/html/rfc3548 Page 16 .html 2.adobe. and Base64 Data Encodings” http://tools.Security Response Portable Document Format Malware Bibliography 1. “The Base16. Adobe.” http://www. “Standard ECMA-262 ECMAScript Language Specifrication” 3.ietf. RFC3548. ecma.

Documentation may include technical or other inaccuracies or typographical errors. NO WARRANTY . Symantec and the Symantec logo are trademarks or registered trademarks of Symantec Corporation or its affiliates in the U. Symantec has operations in more than 40 countries. About the author Kazumasa Itabashi is a Principle Software Engineer at Symantec Security Response in Tokyo specializing in PDF malware. .symantec. All rights reserved.Security Response Any technical information that is made available by Symantec Corporation is the copyrighted work of Symantec Corporation and is owned by Symantec Corporation.. Calif. Other names may be trademarks of their respective owners. CA 94043 USA +1 (650) 527-8000 www. please visit our Web site. storage and systems management solutions to help businesses and consumers secure and manage their information. Headquartered in Moutain View.S.com Copyright © 2010 Symantec Corporation. For product information in the U.S. The technical information is being delivered to you as is and Symantec Corporation makes no warranty as to its accuracy or use. call toll-free 1 (800) 745 6054. and other countries. About Symantec Symantec is a global leader in providing security. Symantec Corporation World Headquarters 350 Ellis Street Mountain View. For specific country offices and contact numbers.com. Symantec reserves the right to make changes without prior notice.symantec. Any use of the technical documentation or the information contained herein is at the risk of the user. More information is available at www..

Sign up to vote on this title
UsefulNot useful