You are on page 1of 7

PDF file

Save Emails format:


to PDF Internal Document
Structure Explained
Easily save Gmail emails and labels as a PDF

Posted On: 11/13/2022


By: Dizdar Senad

Filed Under: PDF format (https://www.save-emails-as-pdf.com/news/category/pdf-format/)

The PDF file format has a basic structure that consists of a header, a body, and a trailer. The
header contains information about the PDF file, such as the version of the PDF file format, the
creation date, and the author of the file. The body of the PDF file contains the actual content of
the file, such as text, images, and other media. The trailer of the PDF file contains information
about the file, such as the size of the file, the checksum of the file, and the location of the file on
the disk.

PDF has more functions than just text: it can include images and other multimedia elements, be
password protected, execute JavaScript and so on. The basic structure of a PDF file is
presented in the picture below:
PDF Header
The header specifies the version number of the PDF specification used in the document.

This can be found by using a hex editor or the xxd command:

$ xxd temp.pdf | head -n 1


0000000: 2550 4446 2d31 2e33 0a25 c4e5 f2e5 eba7 %PDF-1.3.%……

The temp.pdf PDF document uses PDF specification 1.3. The ‘%’ character is a comment in
PDF. This means that the first and second line being comments is true for all PDF documents.
The following bytes are taken from the output below: 2550 4446 2d31 2e33 0a25 c4e5 and
correspond to the ASCII text “%PDF-1.3.%”. This is followed by some ASCII characters that are
using non-printable characters (note the ‘.’ dots), which are usually there to tell some of the
software products that the file contains binary data and shouldn’t be treated as 7-bit ASCII text.
Currently, the version numbers are of the form 1.N, where the N is from range 0-7.

Body of PDF document


The Body section is used to hold all the document’s data that is being shown to the user.
In other words, the body of the PDF document contains objects such as text streams, images,
other multimedia elements, etc.

xref table
This is the cross-reference table. Each object in the document has an entry in this table, which
allows for quick and random access to objects in the file. There is no need to read through the
entire PDF document just to locate a specific object Every entry in the table is 20 bytes long
entire PDF document just to locate a specific object. Every entry in the table is 20 bytes long.

Here is an example:

xref
0 1
0000000023 65535 f
3 1
0000025324 00000 n
21 4
0000025518 00002 n
0000025632 00000 n
0000000024 00001 f
0000000000 00001 f
36 1
0000026900 00000 n

The cross-reference table of a PDF document is located at the bottom of the file. You can
investigate it by opening the PDF in a text editor, such as vi (https://en.wikipedia.org/wiki/Vi).
You will need to scroll to the bottom of the document to see it.

In the example above, there are four subsections (note the four lines that only contain two
numbers). The first number in those lines corresponds to the object number, while the second
line states the number of objects in the current subsection. Each object is represented by one
entry, 20 bytes in total (including the CRLF).

The first 10 bytes define the offset of the object from the start of the PDF document to the
beginning of that object.

What follows is a space separator with another number. That number is called “object’s
generation number”. After that, there is another space separator, followed by a letter “f” or “n” to
indicate whether the object is free or in use.

The first object also contains one entry with object’s generation number 65535. It represent the
head of the list of free objects (the letter “f” that means free).
The last object in the cross-reference table has object’s generation number equal to 0.

Subsection 2 has an object ID of 3 and contains one element- object 3, which starts at offset
25324 bytes from the beginning of the document. Subsection 3 has four objects, with ID 21
i ff 25518 f h b i i f h fil Th i i bj h ID 22 23
starting at offset 25518 from the beginning of the file. The remaining objects have IDs 22, 23,
and 24 respectively.

Every object in a file is assigned a flag that indicates whether the object is currently being used
(“n” for “valid and used”) or not (“f” for “free”). Free objects contain references to the next free
object, as well as the generation number that should be applied if the object becomes valid
again. This helps to ensure that every part of the file is accounted for.

Since object zero points to the next free object in the table, object 23, and since object 23 is
also free and points to the next free object in the table, we can see that objects 24 is pointing
back to zero.

The cross-reference table would look like this if every number was represented:

xref
0 1
0000000023 65535 f
3 1
0000025324 00000 n
21 1
0000025518 00002 n
22 1
0000025632 00000 n
23 1
0000000024 00001 f
24 1
0000000000 00001 f
36 1
0000026900 00000 n

The generation number of the object is incremented when the object becomes valid again.
(changes flag from ‘f’ to ‘n’) If it were removed again, the generation number would increase to
2. So, if object 23 becomes valid again, the generation number will still be 1. However, if it is
removed again, the generation number would increase to 2.

If a PDF document has been incrementally updated, it will usually contain multiple subsections.
Otherwise, it should only contain one subsection starting with the number zero.

Trailer
Trailer
All PDF readers should start reading a PDF from the end of its file. This is because
This is because the PDF trailer contains the location of the cross-reference table and other
special objects to the application reading the PDF document.

The example of the trailer is here;

trailer
<< /Size 22 /Root 2 0 R /Info 1 0 R >>
startxref
24212
%%EOF

The last line of the PDF document contains the end of the “%%EOF” file string. Offset from
beginning of this file to cross-reference table is specified by a “startxref” string appearing before
end-of-file tag. Our cross-reference table starts at offset 24212 bytes. This is preceded by a
trailer string which designates start of Trailer section. The contents of trailer sections are
embedded within << and >> characters (i.e., key-value pairs in dictionary format).

The trailer section defines several keys such as “/Size”, “/Root”, “/Info” and similar.

Incremental updates
PDFs are designed for incremental updates, meaning we can append new objects to the end of
the file without rewriting the whole document. This makes saving changes to a PDF quick and
easy. The new structure of the PDF document is illustrated below:
(https://www.cloudhq.net/c/99726403145915)
The PDF document still contains the original header, body, cross-reference table and trailer.
However, there are also additional body, cross-reference and trailer sections present which
contain information on objects that have been changed, replaced or deleted. Deleted objects
remain in the file but are marked with an “f” flag. Each trailer is terminated by the “%%EOF” tag
and includes a /Prev entry pointing to the previous cross-reference section.

NOTE: PDF documents with versions 1.4 and higher can specify the version entry in the
document’s catalog dictionary. This will override the default version from the PDF header. This
allows us to take advantage of new features without having to worry about older readers not
being able to open the document.
©2024 Save Emails to PDF - All Rights Reserved.

You might also like