This document will describe how libpdf++ generate a PDF file, the PDF file structure and my interpretation of the PDF specification.
Generally, the tokens in a PDF file are arranged into lines. Lines are separated by the new line character. Either carriage return (0x0d) and line feed (0x0a) can be used as line separator, or both. PDF files can also contain binary data, but normally only in stream objects, which will be explained later.
A PDF file is made up of four parts:
The second line is a 4 byte comments. It is recommended by the PDF specifiction to ensure the file is treated as a binary file. In libpdf++, the values of the bytes in FA, CE, BE, AD.
The body of the PDF files contains a sequence of PDF objects. These PDF objects will describe the contents of the document. It is in the format of:
1 0 obj /Type /Page endobj
The above represents an object with ID 1, the content of the object is placed between "1 0 obj" and "endobj". The whole document is made up a sequence of of such objects. The meaning of these PDF objects will be discussed later.
The third part of the file is the cross reference table. The section will store all the offset of PDF objects in the body. In libpdf++, the object of an object is obtained by calling ostream::tellp() before writing the object to the ostream object. As a result, if the ostream object refers to a cout, which is not a regular file, ostream::tellp() will fail. In this case, libpdf++ cannot calculate the object offsets.
The offset of each PDF object in the PDF file body will appear as a row in the cross reference table like this:
0000000100 00000 n <EOL>
The first column in the table is the offset. It must be 10 digits and padded with zeros. The second row is the generation number. The PDF specification has a good description on it, but I don't quite really understand. In libpdf++ I have hard-coded it to be zero and it works. The third one is an operator, which is hard-coded to 'n' in libpdf++. Please note the there must be a space between the 'n' and the "end of line" (EOL) character, or else the file cannot be parsed.
The fourth part of the file is the trailer. It enables an application reading the file to quickly find the cross-reference table and certain special objects. The format of the trailer is in the form of:
trailer << key1 value1 key2 value2 ... keyn valuen >> startxref Byte_offset_of_last_cross-reference_section %EOF
The key-value pairs specifies the "special objects" in the PDF file. It includes:
Key | Type | Value |
Size | integer | The total number of entries in the file's cross-reference table. |
Root | dictionary | The catalog dictionary for the PDF document contained in the file. |
Info | dictionary | The document's information dictionary. |
In libpdf++, the root and info fields of the trailer is an indirect reference to the catalog and info objects in the PDF file body. Their format will be discuss in the document structure section.