You are on page 1of 8

1

PDF investigation with parser differentials and


ontology
Denley Lam Letitia Li Cory Anderson
FAST Labs, BAE Systems
Arlington, VA, USA
denley.lam, letitia.li, cory.s.anderson @baesystems.com

Abstract—This paper describes the Verifiable Automatic


Language Analysis and Recognition for Inputs (VALARIN)
system to process, evaluate, and flag unsafe PDFs. The system
extracts error features from a collection of PDF parsers, and
organizes the different types of error messages by how they impact
file safety. An ontology was designed to describe the relationships
between parsers, error messages, safety, and PDF properties to
support PDF-based malware classification efforts. Our domain-
specific PDF ontology shows that PDF parsers exhibit mutual
biases when recovering from specification ambiguities. Consensus
on extracted error features among parsers had a direct
relationship to the safety of the PDF. The PDF OWL ontology
serves as a shareable method for information security and Fig. 1. PDF file format overview
forensics efforts to highlight discrepancies and aid understanding
by standardizing and describing the hierarchical relationship of cause denial of service, information disclosure, data
diverse parsers, PDF structure, and validity. manipulation, and code execution [6].
Complicated by the complexity of the PDF standard, parser
Keywords—PDF, ontology, malicious, parsing, information development relies on a lossy translation, from a human written
security specification to code that is potentially open to interpretation.
Furthermore, without a reference implementation, numerous
I. INTRODUCTION PDF parsers exist that share various lineage with varying
The Verifiable Automatic Language Analysis and command line arguments. Each version release with additional
Recognition (VALARIN) system was designed for classifying new features are a cause of ambiguities and errors. Over time,
PDFs under DARPA SafeDocs, a program to research the different parsers develop a different approach to implementing
security of electronic documents, including Portable Document the specification, resulting in disagreement on what errors each
Format (PDF) files [1]. In 2018, 250 billion PDF documents parser finds within each PDF.
were opened by Adobe and 8 billion were signed documents [2]. VALARIN leverages the parser differential produced by
PDFs are the de-facto standard for document exchange, has wide those ambiguities. We extract the return status, standard error,
cross platform support, and is trusted. Yet, in 2021, PDF was the and output for each parser, and then merged the separate errors
second highest file format present in web attacks and third reported by individual parsers into a single comprehensive PDF
highest for email attacks [3], a major change from previous years ontology. Our ontology then provides a hierarchical framework
where Microsoft Office documents were the most common for classifying reported errors during investigation of a
malicious file type. Typical malware campaigns start their attack potentially malicious PDF.
with a phishing email containing a malicious PDF, luring the Our contributions are as follows, (1) an OWL ontology
user with social engineering tactics to open the infected file [4]. describing 15 PDF parsers with generated errors and labeled
From there, the compromised computer could further infect with an error classification (2) VALARIN, a system to analyze
connected machines in order to steal sensitive information. parser differentials based on error reporting (3) Successful
The PDF specification has evolved tremendously from its verification of the safety of a PDF file based on parser
early days as a print format to a web format. It now encompasses differentials.
archiving, signing, and encryption, with the latest standard In this paper, we will first present a known set of PDF flaws
exceeding 1000 pages. Fig. 1 is a high-level static view of the and security risks in Section II. Section III describes our parser
PDF file structure, showing the header, body (containing pipeline architecture generating the software artifacts used to
displayed elements in a hierarchical tree), xref (containing develop our ontology, and Section IV describes ontology
references to objects in the body), and footer [5]. Additionally, generation. Section V describes our results on applying the
dynamic actions with the use of JavaScript and event actions can ontology to classify a test set of PDFs, and Section VI describes
two case studies of publically available malicious files. Section

Approved for Public Release; Distribution Unlimited.


Not export controlled per ES-FL-052522-0080.
2

VII presents the related work, and Section VIII concludes the PDF File(s)
Parser(s)
paper and presents avenues for future work. Ingestion Parser(s)
Parser(s)


II. COMMON PDF PROBLEMS
Investigating a PDF is difficult because the format is Stderr files
complex, evolving, and independent parsers implement the
specification differently. A lack of consensus exists among Information
Extraction Ontology generator
parsers on what it means to parse and validate successfully.
These differences introduce unintended consequences that Error Ontology Feature generator
allow malicious actors to take advantage of these flaws. In the
following section, we describe six main types of PDF parsing File:error feature map
errors, though the list is not exhaustive.
A. Redaction Training
Examples
Redaction is the act of removing sensitive information from
documents. Typically, PDF redaction is accomplished by adding Dowker
Analysis Classifier Complex
a black box to hide the text. However, PDFs contain additional
metadata or layered data that cannot be removed with a simple
File classifications
black box. Multiple instances of bad redaction have been
reported in the news, costing the companies millions [7] or Fig. 2. VALARIN system diagram
leaking sensitive government information [8]. Redaction or
hidden by using “/OpenAction” which are performed when the
sanitization is very software dependent. One cause of concern is
that using two different software, one to create the document and document is viewed. Additional complexity occurs when the
redacting with another, may not be successful. In this example, JavaScript action takes advantage of a JavaScript vulnerability
CutePDF/LibreOffice prevented Adobe Reader from redacting or Adobe Reader API vulnerability [14] [15].
the sensitive information [9]. VALARIN does not process for E. CVEs
visual differences when displaying PDFs. Therefore, issues
related to redaction are ignored by VALARIN, unless it The PDF ISO 32000-1:2008 is 700+ pages [16]. PDF is a
generates implementation errors. complex format and has a large attack surface. Over a course of
a year, a security researcher discovered over 150 vulnerabilities
B. Shadow Attacks across a broad range of PDF software, in particular Adobe. The
Shadow Attacks are the opposite of redaction, in that instead top four surface areas containing vulnerabilities are PDF
of blocking sensitive information, we are deliberately falsifying converter, JPEG2000, XML Forms Architecture (XFA), and
information to trick the user. The attacks leverages the concept rendering [17]. For instance, CVE-2020-9715 exposed a
of “view layers” [10]. Changing the layer displayed, the vulnerability in Adobe Reader due to a use-after-free in a
attacker shows the signed trusted layer, but manipulates the JavaScript object, whereby an end-user opening the specially
visible artifacts for nefarious purposes. Three variants of the crafted file allows the attacker to gain code execution on their
attack exist: (1) hide (2) replace (3) hide and replace [11]. Our machine [18].
ontology-based classifier detects attacks occurring when F. Semantic Misunderstanding
parsers report an error parsing the cross reference table and the
PDF contains incremental updates, indicative of a shadow The PDF standard, currently version 2.0, is evolving with
attack. Security researchers have continued to uncover layer new features being added. Misunderstandings occur when
reading the specification, during translating to code, and when
flaws in the certification and encryption process related to PDF
writing the parsers. The PDF Association maintains a git
incremental updates [12]. See Section VI for a case study. repository dedicated to reporting ambiguities found for the PDF
C. Hidden Data specification, further review is performed by Industry and ISO
experts [19]. Ambiguities can occur as a byproduct of the
Hidden Data is the superset of redaction and shadow attack.
permissiveness of the specification or lack of guidance for error
Underscoring its seriousness, the National Security Agency
handling. For example, PDF parsers recover from a missing
(NSA) identified 11 types of hidden data in PDFs that should cross reference table by walking the object list sequentially and
be checked and removed before release, especially by security manually rebuilding the table, which is not an authorized
agencies [8]. Metadata embedded in documents, even after method.
sanitization, may result in sensitive information on agency staff
and software used. One may need specialized tools to remove III. SOFTWARE DESIGN
the data correctly [13]. The software architecture diagram for VALARIN, as shown
D. Actions in Fig. 2, includes three main stages: ingestion, information
extraction (IE), and analysis. During the Ingestion stage, PDFs
PDFs contain JavaScript actions that can download are processed through our parallel pipeline, which is composed
malicious data or redirect to an URL. Examples of these calls of RabbitMQ orchestration and Docker PDF parser containers.
are “getURL()” and “launchURL()”. These actions can be
Approved for Public Release; Distribution Unlimited.
Not export controlled per ES-FL-052522-0080.
3

Continuous progress and performance monitoring are integrated TABLE I


into each step of the pipeline. At the Information Extraction (IE) LIST OF PARSERS
stage, error feature extraction occurs, and the output is merged
with our ontology to generate a list of all error types. Processed Parser Version Command-Line
files are associated with a collection of error features for further Caradoc Jun 12, 2017 extract --xref AA --dump BB --types
processing. For the final analysis stage, the generated results can CC --dot DD / stats / stats --strict
be analyzed, the full corpora or individual files are classified to
produce a final status report for each PDF. Hammer Mar 29, 2022 pdf

A. Ingestion Mutool 1.17.0 clean -s / draw -F / draw -o / show -o


A large set of parsers is necessary for generating an accurate
Origami 2.1.0 --policy paranoid
recognition of parser differentials and the classification of the (pdfcop)
safety of the file. We selected open-source parser projects that
support a Linux environment. In total, we had 31 PDF parsers, Pdfium Jul 9, 2021 V8,V8_EXTERNAL,XFA
including configuration options and utilities. Table I shows the
parsers selected, as well as the command line arguments used. Pdfminer 20201018 dumppdf.py –a / pdf2txt.py
Without formalisms, developing a complete parser from just a
Pdftk 2.02-5 dont_ask verbose
natural language specification leads to shotgun parsers, which
are insecure and lack basic checks [20]. Pdftools 0.2.5 / 0.7.2 pdfid.py -a -e / pdf-parser.py --stats
B. Information Extraction Peepdf 0.3 r275 -j
VALARIN’s feature extractor extracts the standard error
Poppler 0.86.1 pdffonts / pdfinfo / pdftocairo -ps /
and standard output reported by each of the parser. Thousands
pdftoppm / pdftops / pdftotext
of lines of standard error must be filtered to identify only the
unique lines. For example, an error such as `bad character: 'A'` Qpdf 10.1.0 --check
and ‘bad character: ‘B’ are of the same error type, and should
Verapdf 1.17.55 / (greenfield) verapdf --nonpdfext /
be clustered together. Through a normalization procedure, the
2.0.17 (pdfbox) verapdf --nonpdfext
output strings are transformed into representative error regex
strings, as shown in Fig. 3. First, we cluster all regexes by their Xpdf 4.02 pdffonts / pdfinfo / pdftoppm / pdftops /
edit distance. Second, we identify errors of the same type, and pdftotext
generate a single regex to cover that cluster. Third, a pre-
Pdfcpu 0.3.12 pdfcpu validate
defined ruleset is applied to normalize the strings into error
regex strings. The ruleset instructs the Rule Based Error Pdfresurrect 0.23b pdfresurrect
Extractor (RBEE) how to find/replace or filter text matching
provided regexes. The rules are stored as JSON and should aim
C. Analysis
to group errors that are the same type of error with differing
values. Lastly, the ontology error vector labeling occurs on the During the analysis stage, we classify and explain the
error regex. The error vector contains three types to help structure of the labeled data from the extraction stage.
explain the extracted error string in a more comprehensible Depending on the requirements of the user, we can also identify
way, i.e. if the error occurred in the cross reference table, file PDFs of interest that match errors of known malicious error
header, or trailer because of a syntax omission: further detail is vectors.
provided in Section IV. TABLE II
Table II shows the results after information extraction EXAMPLE OF OUTPUT IN FILE LABELED VALID-AMBIGUOUS
occurred on one file: the standard error was extracted and Parser Exit Stderr Output Error Vector
labeled for each parser. Ultimately, the file was classified as Code
Mutool 0 error: cannot find startxref Syntax_error
valid and non-malicious, but containing ambiguities. clean warning: trying to repair broken Omission
xref PDF_xref_table
warning: repairing PDF document
Poppler 0 N/A N/A
pdfinfo
Xpdf 0 Syntax Error: Couldn't read xref Syntax_error
pdfinfo table Omission
Syntax Warning: PDF file is PDF_xref_table
damaged - attempting to
reconstruct xref table...
Qpdf 3 WARNING:: file is damaged Syntax_warning
WARNING:: can't find startxref Omission
Fig. 3. Error string extraction WARNING::Attempting to PDF_xref_table
reconstruct cross-reference table

Approved for Public Release; Distribution Unlimited.


Not export controlled per ES-FL-052522-0080.
4

Fig. 4. Dendrogram view of subset of ontology


VALARIN classifies files based on valid and invalid training because it links relationships and multiple concepts to other
examples and statistical methods. PDFs are either “accepted”, concepts through well-defined properties [24] [26].
“accepted-ambiguous”, or “rejected” [21], where “accepted- Three thousand and twenty one error regexes were gathered
ambiguous” files may need further manual inspection. The by running our set of PDF parsers through Govdocs1 [27] and
ontology characterizes files for potential malicious behavior Common Crawl [28], a collection of nearly one million freely
based on pre-existing label types. VALARIN has also been distributable document files. Each individual parser output error
used to identify groups of similar PDFs, or ‘dialects’ based on regex was labeled with an error vector containing the “error
parser errors and differentials [22]. type”, “error class”, and “PDF file format descriptor”.
Error types originated in studies involving error analysis of
IV. ONTOLOGY CONSTRUCTION grammar errors performed by students leaning English. This
The hierarchical labeling of error regexes is derived from the paper [29] points to the use of surface structure taxonomy, and
PDF ontology classes. Fig. 4 represents a snapshot of the the resultant error types are:
ontology. The ontology continues to evolve as we gather more 1) Addition
data and model different relationships. We have added 2) Omission
additional classes, such as warnings and flags, to reflect the 3) Misordering
knowledge gained as we understand which errors are reported
We apply these same concepts to reading the PDF specification,
for benign documents not following the standard exactly, and
which is dense, difficult to understand, and open to human
which tend to refer to potentially malicious documents or files
interpretation. Error analysis is the process of analyzing,
which are not valid, readable PDFs.
Fig. 5 describes the methodology to develop a PDF ontology
using the Web Ontology Language (OWL). Using Protégé, an
ontology editor, we combined domain specific knowledge, PDF
specification, and the set of parser output results into an
ontology [23]. Each individual parser result is labeled with their
respective PDF descriptor and error classification.
We selected an OWL ontology because it is a part of the
W3C standards for the Semantic Web. It is well suited for
enabling automated reasoning about data, building on essential
concepts, and evolving the learning model with the growth of
new data [24] [25]. OWL allows us flexibility in building a
knowledge construct, modeling the PDF parser differential, and
a hierarchical relationship composed of error classification,
parser error regexes, and file format characteristics.
Furthermore, an ontology is more descriptive than other
knowledge representation, such as maps and taxonomies, Fig. 5. PDF ontology design

Approved for Public Release; Distribution Unlimited.


Not export controlled per ES-FL-052522-0080.
5

observing, interpreting, and classifying learner’s error to give thousand files with valid versus invalid files, defined by parsers
an indication of the learner’s process. The three causes of error “xpdf”, “qpdf”, “pdfbox”, and “mutool”. Universe B is a
in learning are: carelessness, first language inference, and superset of Universe A with an additional one hundred
translation. This paper [30] listed error statistics for each error thousand files and the addition of “poppler” parser. Universe C
type; misinformation (54.17), addition (19.44%), omission is a superset of Universe A with an additional one hundred
(13.19%), and misordering (12.50%). Our results of PDF thousand files and the addition of “pdfminer” parser. The three
classification closely match those results. Conducting a universes were generated with PDF malformations tuned to the
statistical count on parsers error types, we observe that indeed specific parsers within each universe.
syntax errors related to misinformation dominate parser error The purpose of each universe was to emulate a scenario that
types. if given a set of known parsers, and then adding a new parser,
Error class was derived by performing empirical studies of if our tools could discern how the new parser affected the
the error strings reported by the parsers. Through a manual classification of PDFs. Our key understanding is that each
curation of errors, we settled on five error classes. parser has a unique code base with its own interpretation of the
1) Index specification. Disagreements naturally arise between parsers on
2) Reference PDF validity. The results of our analysis of the three universes
3) Syntax demonstrated that our ontology successfully guided VALARIN
4) Type in summarizing and explaining differences across universes.
5) Value Our evaluation results have shown that even among known
PDF file format descriptor is a hierarchical class model of valid files, there is not universal consensus among parsers. Fig.
the PDF file structure. It was limited to three layers of 6 shows the frequency of error classes among the valid files
subclasses, in order to keep the level of information general and across universes. Extracting the error vectors from the parsers
easy to understand. The following are some use cases: across universes, given Universe A is a subset of both B and C;
• Generating a frequency count of the PDF file format we anticipate that the histogram should indicate that Universe
structure per parser’s error output can be a useful guide A has the least amount of errors introduced. Indeed the outcome
for analysts who are interested in finding a subset of is as expected.
problematic issues in PDFs. Each additional new parser introduced and detected vastly
• “Mutool” with fifteen reportable errors for the cross new errors, in particular: (1) Addition to cross reference table,
reference table (the highest among all parser classes) is which parsers typically recover from silently without notifying
better suited for discovery related to cross references. the user. (2) Misformation of fonts, especially in valid files,
• Poppler’s “pdfinfo” is more likely to report issues there are potential syntactic errors in parsing PDF fonts.
related to PDF annotations than “mutool”. Ambivalent font parsing may still result in the parser reporting
• The “pdfinfo” tool may only report cursory high-level a valid PDF. Fonts/Glyphs and the associated font processing
detail while the “mutool clean” tool dives deeper into engine has been known to be rife with vulnerabilities. Most font
images and streams to verify the integrity of the entire engines are large, complex, and share large code bases. Fonts
PDF file. are divided between two sizes: simple fonts that are one byte
Constructing the OWL ontology as a hierarchical collection and composite fonts that can be one or two bytes. The PDF
of individually labeled classes with properties, we can create a specification [16] states five font types: (1) Type 0, (2) Type 1,
model similar to a knowledge graph. In addition, OWL (3) Type 3, (4) TrueType/OpenType, and (5) CIDFont, with
ontologies are easy to share, reusable, and can add new some intermingling between types. Type 1 fonts contain 14
knowledge to a domain. It can formally specify components; standard fonts that are expected to be available to the PDF
individuals, classes, attributes, relationships, and apply processor. TrueType/OpenType font files can be external or
restrictions, rules, and axioms. embedded in the PDF. All the relevant font information should
be captured in a font dictionary. However, each parser takes
V. EVALUATION RESULTS independent action depending on if a font is not available,
As part of the evaluation process, we used our VALARIN misspelled, or syntactically incorrect. Fig. 6 highlights that
system to process and classify files from three different sets malformed fonts are still an ongoing concern in recent parsers
(named universes in the SafeDocs program), A, B, and C, [2] [31].
totaling one million files. Universe A contains eight hundred
PDF Error Vectors for Valid Files
('Syntax_error', 'Misformation', 'PDF_resources')
('Syntax_error', 'Misformation', 'PDF_indirect_objects')
('Syntax_error', 'Addition', 'PDF_xref_table')
('Syntax_error', 'Misformation', 'PDF_annotations')
('Syntax_error', 'Omission', 'PDF_indirect_objects')
('Syntax_error', 'Misformation', 'PDF_fonts')
Universe A Universe B Universe C

Fig. 6. PDF fonts are a cause of parser errors even in valid PDF files
Approved for Public Release; Distribution Unlimited.
Not export controlled per ES-FL-052522-0080.
6

PDF Error Vectors for Rejected Files


('Syntax_error', 'Misformation', 'PDF_fonts')
('Value_error', 'Omission', 'PDF_trailer')
('Syntax_error', 'Misformation', 'PDF_file_structure')
(None, None, 'General_file_rwo')
('Type_error', 'Misformation', 'PDF_objects')
('Syntax_error', 'Omission', 'PDF_trailer')
('Syntax_error', 'Omission', 'PDF_xref_table')
('Syntax_error', 'Misformation', 'PDF_xref_table')
Universe A Universe B Universe C
Fig. 7. Malformed cross-reference tables are a common issue for invalid PDF files
Comparing Fig. 6 against Fig. 7, we see very little overlap
between valid and invalid files. We note that the addition of VI. CASE STUDY OF MALICIOUS FILES
more parsers results in more errors of the same types. For the To further illustrate our classification approach, we also ran
valid files, there was added information to the cross reference publically available PDFs on VALARIN and show two results.
table that did not affect parsing. However, in the invalid files, VALARIN successfully identified the malicious features in the
if the cross reference table contained missing data or was two files.
malformed, the parser likely failed to recover. The second The first PDF, pdf-doc-vba-eicar-dropper.pdf accessible
highest error vector count in invalid files, (Syntax_error, from [37], uses JavaScript to automatically extract an
Omission, PDF_xref_table), indicates the trailer was missing or embedded Word document and then saves the file to our
it was malformed enough to prevent successful parsing. system. This example file is benign, but demonstrates the
The summary of our results are: capability of PDFs to potentially extract and save malware on a
• A valid PDF file may still exhibit ambiguous parsing or victim’s machine. Table III shows the results for each parser,
contain errors. A successful parse by one parser is not a with errors detected and their corresponding ontology tags. We
guarantee of a similar behavior in another parser. can see that the file itself has no malformations, but a few
• Using our ontology, we correctly identified malicious parsers are capable of detecting the embedded document, as
files exhibiting missing data or extra data. well as the OpenAction and Javascript external actions.
• Implementing error recovery or backward compatibility Another PDF contains a Shadow Attack, as described in
is open to interpretation by the parser, if not explicitly Section II.B, where the signed document has been manipulated
stated in the specification. to replace the original font descriptor object. The file used is
• We can discover verification flaws or file issues by variant-2_replace-via-overwrite/4_original-document-
comparing parser error differentials using an ontology. shadowed-signed-manipulated.pdf provided at [38]. Table IV
• As additional PDF-based cybersecurity threats are again shows the errors detected by each parser, and the
reported, we can apply our ontology to categorize the ontology tags classifying each error. We can see that there are
malicious files used, and suggest which parsers are most a few malformations in the file, such as extraneous bytes and
accurate in detecting those malicious files. xref issues. The shadow attack is then detected by looking for
an object reference error in pdfcpu and the presence of
TABLE III incremental updates in pdfresurrect (by checking that the
PDF JAVASCRIPT DROPPER CLASSIFICATION number of versions is greater than 1).

Parser Error regex Ontology Tags VII. RELATED WORK


The PDF Association, an organization dedicated to the
Origami “OpenAction “External_action”, “OpenAction” promotion and education of PDF technologies, created PDF/A
entry = YES”
which replaces normal PDFs for archival purposes. PDF/A
Pdfminer “Embedded “PDF_Embedded_Object”
addresses the shortcoming in the PDF format by requiring all
File”
content necessary for rendering be self-contained and prohibits
Pdftools “/OpenAction” “External_action”, “OpenAction”
“JavaScript” “External_action”, “JavaScript”
dynamic content [32].
The PDF format does not have a reference implementation.
Peepdf “Actions:/JavaS “External_action”, “JavaScript” However, the “Arlington PDF DOM” model is a specification-
cript”
“Triggers:/Ope “External_action”, “OpenAction”
derived, machine-readable definition of the full PDF document
nAction” object model (DOM). Tab Separated Value (TSV) files and a
“elements:/Emb “PDF_Embedded_Object” “TestGrammer” tool can be used to validate PDFs [33]. Our
eddedFile” PDF ontology does not aim to represent the complete PDF
Pdfcpu “validation “External_action”, “JavaScript” specification, but produces a simplified model for easier
error (obj#:*):
JavaScript: understanding, including parser error information.
unsupported in
version *”

Approved for Public Release; Distribution Unlimited.


Not export controlled per ES-FL-052522-0080.
7

TABLE IV Significant research has gone into identifying maliciousness


PDF SHADOW ATTACK CLASSIFICATION embedded in PDFs. In particular, PDF malware parsers first
extract the structure of PDF for features, and then applying
Parser Error regex Ontology Tags machine learning on the feature set to train a classifier to
determine if a PDF is malicious [34]. Our approach primarily
Caradoc “PDF error : Lexing “Addition”,
error : unexpected word “PDF_file_structure”,
utilizes the error output of the parsers as features to verify
at offset * in file !” ‘Syntax_error” safety.
In [35], a taxonomy describes the features and malware
Hammer “error parsing xref “Misformation”, approach to identifying malicious PDF files. Our approach is
section at position *” “PDF_xref_table”, different; our ontology is composed of hierarchical classes
“Syntax_error”
containing error messages, return code, and error vectors. Our
“VIOLATION[1]@*: “Addition”, collection of PDF parsers in the VALARIN system determines
Greater-than-2-byte WS “PDF_xref_table”, the safety of the PDF. This is markedly different to malware
at end of xref entry “Syntax_error” PDF tools, which are interested in identifying maliciousness
(severity=1)”
alone, as defined by behavior with code execution [36].
Mutool “error: cannot recognize “Misformation”, While using different parsers to parse PDFs can be construed
xref format” “PDF_xref_table”, as another form of PDF malware detection, VALARIN
“Syntax error”
additionally uses the differential between parsers itself as an
warning: trying to repair “Misformation”, identifying feature. By further enriching the error differentials
broken xref “PDF_xref_table”, of each parser with addition error properties in the PDF
“Syntax wrn” ontology, we gain a richer knowledge of PDF semantic
Origami “AcroForm = YES” “External_action”, “AcroForm” differences, more human understanding, and potentially
Peepdf “Triggers:/AcroForm” “External_action”, “AcroForm” detection of new novel attacks on PDFs.

Poppler “Syntax Error: Couldn't “Omission”, VIII. CONCLUSION


find trailer dictionary” “PDF_trailer”, Our VALARIN system for format analysis has successfully
“Syntax_error”
classified PDF files and error messages based on a hierarchical
“Syntax Error: Invalid “Misformation”, ontology, but continues to evolve. Additional formats enrich our
XRef entry \\d+” “PDF_xref_table”, awareness of new parsers and help us gain insight into new
“Syntax_error” parser differentials. Acquired knowledge is integrated back to
“Internal Error: xref “Omission”,
num \\d+ not found but “PDF_xref_table”,
our knowledge graph, comprised of our error ontology. It has
needed, try to “Syntax_error” expanded to account for flagging various levels of safety,
reconstruct warnings, and attack behaviors.
<\\d+a> A limitation of domain-specific ontology generation is that
"”
the base class hierarchy perform better when initially designed
Qpdf WARNING: *: file is “General_file_rwo”
damaged by a knowledgeable user of the file format. However, as our
WARNING: * (offset “Omission”, framework matures, we are exploring semi-automated methods
\\d+): xref not found “PDF_xref_table”, of ontology generation and merging, such as using natural
“Syntax_error” language processing by applying frequency analysis with word
“WARNING: *: “General_informative”
vectors or ontology matching with keywords. Subsequent
Attempting to successful classification with JPEG and MPEG files proved
reconstruct cross- VALARIN’s resiliency and usefulness in adapting to new file
reference table” formats.
Xpdf “Syntax Error: Couldn't “Omission”, As the PDF specification changes or adapts to extant parser
read xref table” “PDF_xref_table”,
“Syntax_error” differentials it becomes increasingly difficult to determine what
“Syntax Warning: PDF “Misformation”, is a safe PDF. VALARIN’s adoption of a corpus of parsers has
file is damaged - “PDF_xref_table”, shown efficacy in identifying problem areas in the PDF format
attempting to “Syntax_error” and classifying unsafe PDFs. Additionally, the PDF ontology
reconstruct xref table.”
provides an analyst an explainable summary of the outcome of
Pdfcpu “dereferenceObject: "Syntax_error",
problem dereferencing "Misformation",
examining a PDF. For malware analysis, we have shown that
object \\d+: pdfcpu: "PDF_file_structure" using a set of PDF parser output differentials is a viable feature
ParseObjectAttributes: for identifying potential malicious behavior. Our future work
can't find "obj"” include transitioning VALARIN into a document-scanning
pipeline to alert on unwanted behavior, and further maturing our
Pdfresurre “Versions: ^[2-9]|[1- "Addition",
ct 9]\d+$” "Value_wrn",
ontology.
"PDF_incremental_updates"

Approved for Public Release; Distribution Unlimited.


Not export controlled per ES-FL-052522-0080.
8

ACKNOWLEDGMENT [17] K. Liu, “Dig into the attack surface of PDF and gain 100 CVEs in 1
Year,” https://www.blackhat.com/docs/asia-17/materials/asia-17-Liu-
The authors would like to thank the SafeDocs community, Dig-Into-The-Attack-Surface-Of-PDF-And-Gain-100-CVEs-In-1-Year-
especially Dan Becker and his team at Kudu Dynamics, for wp.pdf, Blackhat 2017, [Accessed April 2022].
support and providing test data, and Sergey Bratus for his [18] S. Martinez, “Analysis of a use-after-free vulnerability in Adobe
Acrobat Reader DC,” https://blog.exodusintel.com/2021/04/20/analysis-
support and feedback on this work. This material is based upon of-a-use-after-free-vulnerability-in-adobe-acrobat-reader-dc/, [Accessed
work supported by the Defense Advanced Research Projects April 2022].
Agency (DARPA) under Contract No. HR001119C0072. Any [19] PDF Association, “PDF-issues,” https://github.com/pdf-association/pdf-
opinions, findings and conclusions or recommendations issues, [Accessed April 2022].
expressed in this material are those of the author(s) and do not [20] F. Momot, S. Bratus, S. M. Hallberg, and M. L. Patterson, “The seven
necessarily reflect the views of DARPA. turrets of babel: A taxonomy of langsec errors and how to expunge them,”
in 2016 IEEE Cybersecurity Development (SecDev). IEEE, 2016.
REFERENCES [21] M. Robinson, “Looking for non-compliant documents using error
messages from multiple parsers,” 2021 LangSec Workshop, May 2021.
[1] “DARPA SafeDocs Program,” https://www.darpa.mil/program/safe-
documents, [Acessed April 2022]. [22] M. Robinson, L. Li, C. Anderson, S. Huntsman, “Statistical detection of
format dialects using weighted Dowker complex,”
[2] M. Jurczyk. “One font vulnerability to rule them all,” https://arxiv.org/pdf/2201.08267.pdf, Jan. 2022.
https://j00ru.vexillium.org/slides/2015/recon.pdf, REcon. 2015.
[23] Stanford University, “Protégé,” https://protege.stanford.edu/, [Acessed
[3] Check Point. “Cyber security report,” May 2022].
https://pages.checkpoint.com/cyber-security-report-2022, [Accessed
October 2022]. [24] “What are ontologies?”
https://www.ontotext.com/knowledgehub/fundamentals/what-are-
[4] HP Threat Research Blog, “PDF malware is not yet dead,” ontologies, [Accessed April 2022].
https://threatresearch.ext.hp.com/pdf-malware-is-not-yet-dead,
[Accessed October 2022]. [25] “OWL 101,” https://cambridgesemantics.com/blog/semantic-
university/learn-owl-rdfs/owl-101, [Accessed April 2022].
[5] Lukan, Dejan. "PDF File Format: Basic Structure [updated 2020],"
[26] T. Gruber, “What is an ontology?” http://www-ksl.stanford.edu/kst/what-
Infosec Resources, 27 May 2021,
is-an-ontology.html, [Accessed April 2022].
resources.infosecinstitute.com/topic/pdf-file-format-basic-
structure/.[Accessed 9 Nov. 2022.]. [27] “Govdocs,” https://digitalcorpora.org/corpora/files, [Accessed April
[6] N. Fleury, T. Dubrunquez, and I. Alouani, “PDF-Malware: An overview 2022].
on threats, detection, and evasion attacks,” [28] “Common Crawl,” http://commoncrawl.org, [Accessed April 2022].
https://arxiv.org/pdf/2107.12873.pdf, Jul. 27, 2021. [29] N. Ma’mun, “The grammatical errors on the paragraph writings,” Jurnal
[7] R.J. Fedor, “Botched Redaction: IRS audit reveals Bristol Myers offshore Vision, Vol. 5 Num 1, April 2016.
tax fight,” https://www.fedortax.com/blog/botched-redaction-irs-audit- [30] J. Pardosi, R. Veronika Karo, O. Sijabat, etc, al, “An error analysis of
reveals-bristol-myers-offshore-tax-fight, [Accessed April 2022]. students in writing narrative text,” Linguistic, English, Education, and Art
[8] S. Adhatarao and C. Lauradoux, “Exploitation and sanitization of hidden Journal, Vol 3 Num 1, Dec. 2019.
data in PDF files,” https://arxiv.org/abs/2103.02707, Mar. 2021. [31] NIST National Vulnerability Database, “CVE-2022-24091 detail,”
[9] “An examination of the redaction functionality of Adobe Acrobat Pro https://nvd.nist.gov/vuln/detail/CVE-2022-24091, [Accessed October
DC 2017,” https://www.cyber.gov.au/acsc/view-all- 2022].
content/publications/examination-redaction-functionality-adobe-acrobat- [32] PDF Association, “PDF/A FAQ,” https://www.pdfa.org/pdfa-faq/,
pro-dc-2017, [Accessed April 2022, Last Updated Oct. 2021]. [Accessed April 2022].
[10] C. Cimpanu, “New 'Shadow Attack' can replace content in digitally [33] P. Wyatt, “Arlington PDF model,” https://github.com/pdf-
signed PDF files,” https://www.zdnet.com/article/new-shadow-attack- association/arlington-pdf-model, [Accessed April 2022].
can-replace-content-in-digitally-signed-pdf-files/, [Accessed April
2022]. [34] J. Zhang, “MLPdf: An effective machine learning based approach for
PDF malware detection,” https://arxiv.org/pdf/1808.06991.pdf, Aug. 21,
[11] “PDF insecurity,” https://pdf-insecurity.org/, [Accessed April 2022]. 2018.
[12] S. Rohlmann, V. Mladenov, C. Mainka, and J. Schwenk, “Breaking the [35] M. Elingiusti, L. Aniello, L. Querzoni, R. Baldoni, “PDF-malware
specification: PDF certification,” 2021 IEEE Symposium on Security, detection: a survey and taxonomy of current techniques,”
May 2021. https://core.ac.uk/download/pdf/188824539.pdf, 2018.
[13] L. Schroeder, “Is the information you just redacted really gone?” PDF [36] J. Muller, D. Noss, C. Mainka, V. Mladenov, and J. Schewnk, “Portable
Association, https://www.pdfa.org/is-the-information-you-just-redacted- document flaws 101,” https://i.blackhat.com/USA-20/Thursday/us-20-
really-gone/, [Accessed April 2022]. Mueller-Portable-Document-Flaws-101.pdf, Blackhat USA, 2020.
[14] A. Blonce, E. Filiol, and L. Frayssignes, “Portable document format (pdf) [37] Stevens, Didier. "Test file: PDF with embedded DOC dropping EICAR,"
security analysis and malware threats,” in Presentations of Europe 28 Aug. 2015, blog.didierstevens.com/2015/08/28/test-file-pdf-with-
BlackHat 2008 Conference, 2008. embedded-doc-dropping-eicar/.
[15] Laskov, Pavel, and Nedim Šrndić. “Static detection of malicious [38] “Attacks on PDF signatures,” https://pdf-
JavaScript-bearing PDF documents,” in Proceedings of the 27th annual insecurity.org/download/exploits-shadow/replace.zip, [Accessed
computer security applications conference, 2011. November 2022].
[16] “ISO - ISO 32000-1:2008,” https://www.iso.org/standard/51502.html,
[Accessed April 2022].

Approved for Public Release; Distribution Unlimited.


Not export controlled per ES-FL-052522-0080.

You might also like