Professional Documents
Culture Documents
VII presents the related work, and Section VIII concludes the PDF File(s)
Parser(s)
paper and presents avenues for future work. Ingestion Parser(s)
Parser(s)
…
II. COMMON PDF PROBLEMS
Investigating a PDF is difficult because the format is Stderr files
complex, evolving, and independent parsers implement the
specification differently. A lack of consensus exists among Information
Extraction Ontology generator
parsers on what it means to parse and validate successfully.
These differences introduce unintended consequences that Error Ontology Feature generator
allow malicious actors to take advantage of these flaws. In the
following section, we describe six main types of PDF parsing File:error feature map
errors, though the list is not exhaustive.
A. Redaction Training
Examples
Redaction is the act of removing sensitive information from
documents. Typically, PDF redaction is accomplished by adding Dowker
Analysis Classifier Complex
a black box to hide the text. However, PDFs contain additional
metadata or layered data that cannot be removed with a simple
File classifications
black box. Multiple instances of bad redaction have been
reported in the news, costing the companies millions [7] or Fig. 2. VALARIN system diagram
leaking sensitive government information [8]. Redaction or
hidden by using “/OpenAction” which are performed when the
sanitization is very software dependent. One cause of concern is
that using two different software, one to create the document and document is viewed. Additional complexity occurs when the
redacting with another, may not be successful. In this example, JavaScript action takes advantage of a JavaScript vulnerability
CutePDF/LibreOffice prevented Adobe Reader from redacting or Adobe Reader API vulnerability [14] [15].
the sensitive information [9]. VALARIN does not process for E. CVEs
visual differences when displaying PDFs. Therefore, issues
related to redaction are ignored by VALARIN, unless it The PDF ISO 32000-1:2008 is 700+ pages [16]. PDF is a
generates implementation errors. complex format and has a large attack surface. Over a course of
a year, a security researcher discovered over 150 vulnerabilities
B. Shadow Attacks across a broad range of PDF software, in particular Adobe. The
Shadow Attacks are the opposite of redaction, in that instead top four surface areas containing vulnerabilities are PDF
of blocking sensitive information, we are deliberately falsifying converter, JPEG2000, XML Forms Architecture (XFA), and
information to trick the user. The attacks leverages the concept rendering [17]. For instance, CVE-2020-9715 exposed a
of “view layers” [10]. Changing the layer displayed, the vulnerability in Adobe Reader due to a use-after-free in a
attacker shows the signed trusted layer, but manipulates the JavaScript object, whereby an end-user opening the specially
visible artifacts for nefarious purposes. Three variants of the crafted file allows the attacker to gain code execution on their
attack exist: (1) hide (2) replace (3) hide and replace [11]. Our machine [18].
ontology-based classifier detects attacks occurring when F. Semantic Misunderstanding
parsers report an error parsing the cross reference table and the
PDF contains incremental updates, indicative of a shadow The PDF standard, currently version 2.0, is evolving with
attack. Security researchers have continued to uncover layer new features being added. Misunderstandings occur when
reading the specification, during translating to code, and when
flaws in the certification and encryption process related to PDF
writing the parsers. The PDF Association maintains a git
incremental updates [12]. See Section VI for a case study. repository dedicated to reporting ambiguities found for the PDF
C. Hidden Data specification, further review is performed by Industry and ISO
experts [19]. Ambiguities can occur as a byproduct of the
Hidden Data is the superset of redaction and shadow attack.
permissiveness of the specification or lack of guidance for error
Underscoring its seriousness, the National Security Agency
handling. For example, PDF parsers recover from a missing
(NSA) identified 11 types of hidden data in PDFs that should cross reference table by walking the object list sequentially and
be checked and removed before release, especially by security manually rebuilding the table, which is not an authorized
agencies [8]. Metadata embedded in documents, even after method.
sanitization, may result in sensitive information on agency staff
and software used. One may need specialized tools to remove III. SOFTWARE DESIGN
the data correctly [13]. The software architecture diagram for VALARIN, as shown
D. Actions in Fig. 2, includes three main stages: ingestion, information
extraction (IE), and analysis. During the Ingestion stage, PDFs
PDFs contain JavaScript actions that can download are processed through our parallel pipeline, which is composed
malicious data or redirect to an URL. Examples of these calls of RabbitMQ orchestration and Docker PDF parser containers.
are “getURL()” and “launchURL()”. These actions can be
Approved for Public Release; Distribution Unlimited.
Not export controlled per ES-FL-052522-0080.
3
observing, interpreting, and classifying learner’s error to give thousand files with valid versus invalid files, defined by parsers
an indication of the learner’s process. The three causes of error “xpdf”, “qpdf”, “pdfbox”, and “mutool”. Universe B is a
in learning are: carelessness, first language inference, and superset of Universe A with an additional one hundred
translation. This paper [30] listed error statistics for each error thousand files and the addition of “poppler” parser. Universe C
type; misinformation (54.17), addition (19.44%), omission is a superset of Universe A with an additional one hundred
(13.19%), and misordering (12.50%). Our results of PDF thousand files and the addition of “pdfminer” parser. The three
classification closely match those results. Conducting a universes were generated with PDF malformations tuned to the
statistical count on parsers error types, we observe that indeed specific parsers within each universe.
syntax errors related to misinformation dominate parser error The purpose of each universe was to emulate a scenario that
types. if given a set of known parsers, and then adding a new parser,
Error class was derived by performing empirical studies of if our tools could discern how the new parser affected the
the error strings reported by the parsers. Through a manual classification of PDFs. Our key understanding is that each
curation of errors, we settled on five error classes. parser has a unique code base with its own interpretation of the
1) Index specification. Disagreements naturally arise between parsers on
2) Reference PDF validity. The results of our analysis of the three universes
3) Syntax demonstrated that our ontology successfully guided VALARIN
4) Type in summarizing and explaining differences across universes.
5) Value Our evaluation results have shown that even among known
PDF file format descriptor is a hierarchical class model of valid files, there is not universal consensus among parsers. Fig.
the PDF file structure. It was limited to three layers of 6 shows the frequency of error classes among the valid files
subclasses, in order to keep the level of information general and across universes. Extracting the error vectors from the parsers
easy to understand. The following are some use cases: across universes, given Universe A is a subset of both B and C;
• Generating a frequency count of the PDF file format we anticipate that the histogram should indicate that Universe
structure per parser’s error output can be a useful guide A has the least amount of errors introduced. Indeed the outcome
for analysts who are interested in finding a subset of is as expected.
problematic issues in PDFs. Each additional new parser introduced and detected vastly
• “Mutool” with fifteen reportable errors for the cross new errors, in particular: (1) Addition to cross reference table,
reference table (the highest among all parser classes) is which parsers typically recover from silently without notifying
better suited for discovery related to cross references. the user. (2) Misformation of fonts, especially in valid files,
• Poppler’s “pdfinfo” is more likely to report issues there are potential syntactic errors in parsing PDF fonts.
related to PDF annotations than “mutool”. Ambivalent font parsing may still result in the parser reporting
• The “pdfinfo” tool may only report cursory high-level a valid PDF. Fonts/Glyphs and the associated font processing
detail while the “mutool clean” tool dives deeper into engine has been known to be rife with vulnerabilities. Most font
images and streams to verify the integrity of the entire engines are large, complex, and share large code bases. Fonts
PDF file. are divided between two sizes: simple fonts that are one byte
Constructing the OWL ontology as a hierarchical collection and composite fonts that can be one or two bytes. The PDF
of individually labeled classes with properties, we can create a specification [16] states five font types: (1) Type 0, (2) Type 1,
model similar to a knowledge graph. In addition, OWL (3) Type 3, (4) TrueType/OpenType, and (5) CIDFont, with
ontologies are easy to share, reusable, and can add new some intermingling between types. Type 1 fonts contain 14
knowledge to a domain. It can formally specify components; standard fonts that are expected to be available to the PDF
individuals, classes, attributes, relationships, and apply processor. TrueType/OpenType font files can be external or
restrictions, rules, and axioms. embedded in the PDF. All the relevant font information should
be captured in a font dictionary. However, each parser takes
V. EVALUATION RESULTS independent action depending on if a font is not available,
As part of the evaluation process, we used our VALARIN misspelled, or syntactically incorrect. Fig. 6 highlights that
system to process and classify files from three different sets malformed fonts are still an ongoing concern in recent parsers
(named universes in the SafeDocs program), A, B, and C, [2] [31].
totaling one million files. Universe A contains eight hundred
PDF Error Vectors for Valid Files
('Syntax_error', 'Misformation', 'PDF_resources')
('Syntax_error', 'Misformation', 'PDF_indirect_objects')
('Syntax_error', 'Addition', 'PDF_xref_table')
('Syntax_error', 'Misformation', 'PDF_annotations')
('Syntax_error', 'Omission', 'PDF_indirect_objects')
('Syntax_error', 'Misformation', 'PDF_fonts')
Universe A Universe B Universe C
Fig. 6. PDF fonts are a cause of parser errors even in valid PDF files
Approved for Public Release; Distribution Unlimited.
Not export controlled per ES-FL-052522-0080.
6
ACKNOWLEDGMENT [17] K. Liu, “Dig into the attack surface of PDF and gain 100 CVEs in 1
Year,” https://www.blackhat.com/docs/asia-17/materials/asia-17-Liu-
The authors would like to thank the SafeDocs community, Dig-Into-The-Attack-Surface-Of-PDF-And-Gain-100-CVEs-In-1-Year-
especially Dan Becker and his team at Kudu Dynamics, for wp.pdf, Blackhat 2017, [Accessed April 2022].
support and providing test data, and Sergey Bratus for his [18] S. Martinez, “Analysis of a use-after-free vulnerability in Adobe
Acrobat Reader DC,” https://blog.exodusintel.com/2021/04/20/analysis-
support and feedback on this work. This material is based upon of-a-use-after-free-vulnerability-in-adobe-acrobat-reader-dc/, [Accessed
work supported by the Defense Advanced Research Projects April 2022].
Agency (DARPA) under Contract No. HR001119C0072. Any [19] PDF Association, “PDF-issues,” https://github.com/pdf-association/pdf-
opinions, findings and conclusions or recommendations issues, [Accessed April 2022].
expressed in this material are those of the author(s) and do not [20] F. Momot, S. Bratus, S. M. Hallberg, and M. L. Patterson, “The seven
necessarily reflect the views of DARPA. turrets of babel: A taxonomy of langsec errors and how to expunge them,”
in 2016 IEEE Cybersecurity Development (SecDev). IEEE, 2016.
REFERENCES [21] M. Robinson, “Looking for non-compliant documents using error
messages from multiple parsers,” 2021 LangSec Workshop, May 2021.
[1] “DARPA SafeDocs Program,” https://www.darpa.mil/program/safe-
documents, [Acessed April 2022]. [22] M. Robinson, L. Li, C. Anderson, S. Huntsman, “Statistical detection of
format dialects using weighted Dowker complex,”
[2] M. Jurczyk. “One font vulnerability to rule them all,” https://arxiv.org/pdf/2201.08267.pdf, Jan. 2022.
https://j00ru.vexillium.org/slides/2015/recon.pdf, REcon. 2015.
[23] Stanford University, “Protégé,” https://protege.stanford.edu/, [Acessed
[3] Check Point. “Cyber security report,” May 2022].
https://pages.checkpoint.com/cyber-security-report-2022, [Accessed
October 2022]. [24] “What are ontologies?”
https://www.ontotext.com/knowledgehub/fundamentals/what-are-
[4] HP Threat Research Blog, “PDF malware is not yet dead,” ontologies, [Accessed April 2022].
https://threatresearch.ext.hp.com/pdf-malware-is-not-yet-dead,
[Accessed October 2022]. [25] “OWL 101,” https://cambridgesemantics.com/blog/semantic-
university/learn-owl-rdfs/owl-101, [Accessed April 2022].
[5] Lukan, Dejan. "PDF File Format: Basic Structure [updated 2020],"
[26] T. Gruber, “What is an ontology?” http://www-ksl.stanford.edu/kst/what-
Infosec Resources, 27 May 2021,
is-an-ontology.html, [Accessed April 2022].
resources.infosecinstitute.com/topic/pdf-file-format-basic-
structure/.[Accessed 9 Nov. 2022.]. [27] “Govdocs,” https://digitalcorpora.org/corpora/files, [Accessed April
[6] N. Fleury, T. Dubrunquez, and I. Alouani, “PDF-Malware: An overview 2022].
on threats, detection, and evasion attacks,” [28] “Common Crawl,” http://commoncrawl.org, [Accessed April 2022].
https://arxiv.org/pdf/2107.12873.pdf, Jul. 27, 2021. [29] N. Ma’mun, “The grammatical errors on the paragraph writings,” Jurnal
[7] R.J. Fedor, “Botched Redaction: IRS audit reveals Bristol Myers offshore Vision, Vol. 5 Num 1, April 2016.
tax fight,” https://www.fedortax.com/blog/botched-redaction-irs-audit- [30] J. Pardosi, R. Veronika Karo, O. Sijabat, etc, al, “An error analysis of
reveals-bristol-myers-offshore-tax-fight, [Accessed April 2022]. students in writing narrative text,” Linguistic, English, Education, and Art
[8] S. Adhatarao and C. Lauradoux, “Exploitation and sanitization of hidden Journal, Vol 3 Num 1, Dec. 2019.
data in PDF files,” https://arxiv.org/abs/2103.02707, Mar. 2021. [31] NIST National Vulnerability Database, “CVE-2022-24091 detail,”
[9] “An examination of the redaction functionality of Adobe Acrobat Pro https://nvd.nist.gov/vuln/detail/CVE-2022-24091, [Accessed October
DC 2017,” https://www.cyber.gov.au/acsc/view-all- 2022].
content/publications/examination-redaction-functionality-adobe-acrobat- [32] PDF Association, “PDF/A FAQ,” https://www.pdfa.org/pdfa-faq/,
pro-dc-2017, [Accessed April 2022, Last Updated Oct. 2021]. [Accessed April 2022].
[10] C. Cimpanu, “New 'Shadow Attack' can replace content in digitally [33] P. Wyatt, “Arlington PDF model,” https://github.com/pdf-
signed PDF files,” https://www.zdnet.com/article/new-shadow-attack- association/arlington-pdf-model, [Accessed April 2022].
can-replace-content-in-digitally-signed-pdf-files/, [Accessed April
2022]. [34] J. Zhang, “MLPdf: An effective machine learning based approach for
PDF malware detection,” https://arxiv.org/pdf/1808.06991.pdf, Aug. 21,
[11] “PDF insecurity,” https://pdf-insecurity.org/, [Accessed April 2022]. 2018.
[12] S. Rohlmann, V. Mladenov, C. Mainka, and J. Schwenk, “Breaking the [35] M. Elingiusti, L. Aniello, L. Querzoni, R. Baldoni, “PDF-malware
specification: PDF certification,” 2021 IEEE Symposium on Security, detection: a survey and taxonomy of current techniques,”
May 2021. https://core.ac.uk/download/pdf/188824539.pdf, 2018.
[13] L. Schroeder, “Is the information you just redacted really gone?” PDF [36] J. Muller, D. Noss, C. Mainka, V. Mladenov, and J. Schewnk, “Portable
Association, https://www.pdfa.org/is-the-information-you-just-redacted- document flaws 101,” https://i.blackhat.com/USA-20/Thursday/us-20-
really-gone/, [Accessed April 2022]. Mueller-Portable-Document-Flaws-101.pdf, Blackhat USA, 2020.
[14] A. Blonce, E. Filiol, and L. Frayssignes, “Portable document format (pdf) [37] Stevens, Didier. "Test file: PDF with embedded DOC dropping EICAR,"
security analysis and malware threats,” in Presentations of Europe 28 Aug. 2015, blog.didierstevens.com/2015/08/28/test-file-pdf-with-
BlackHat 2008 Conference, 2008. embedded-doc-dropping-eicar/.
[15] Laskov, Pavel, and Nedim Šrndić. “Static detection of malicious [38] “Attacks on PDF signatures,” https://pdf-
JavaScript-bearing PDF documents,” in Proceedings of the 27th annual insecurity.org/download/exploits-shadow/replace.zip, [Accessed
computer security applications conference, 2011. November 2022].
[16] “ISO - ISO 32000-1:2008,” https://www.iso.org/standard/51502.html,
[Accessed April 2022].