PDF Implementation Note

Main Page Namespace List Class Hierarchy Compound List File List Namespace Members Compound Members Related Pages

PDF Implementation Note

This document will describe how libpdf++ generate a PDF file, the PDF file structure and my interpretation of the PDF specification.

Generally, the tokens in a PDF file are arranged into lines. Lines are separated by the new line character. Either carriage return (0x0d) and line feed (0x0a) can be used as line separator, or both. PDF files can also contain binary data, but normally only in stream objects, which will be explained later.

File Structure

A PDF file is made up of four parts:

A one-line header identifying the version of the PDF specication to which the file conforms.
A body containing the objects that make up the document contained in the file.
A cross-reference table containing information about the indirect objects in the file.
A trailer giving the location of the cross-reference table and of certain special objects within the body of the file.

In libpdf++, the header is hard-coded to be "PDF-1.4", indicating it uses the PDF 1.4 specification. In PDF file, lines start with a "%" character is considered comments.

The second line is a 4 byte comments. It is recommended by the PDF specifiction to ensure the file is treated as a binary file. In libpdf++, the values of the bytes in FA, CE, BE, AD.

The body of the PDF files contains a sequence of PDF objects. These PDF objects will describe the contents of the document. It is in the format of:

	1 0 obj
	/Type	/Page
	endobj

The above represents an object with ID 1, the content of the object is placed between "1 0 obj" and "endobj". The whole document is made up a sequence of of such objects. The meaning of these PDF objects will be discussed later.

The third part of the file is the cross reference table. The section will store all the offset of PDF objects in the body. In libpdf++, the object of an object is obtained by calling ostream::tellp() before writing the object to the ostream object. As a result, if the ostream object refers to a cout, which is not a regular file, ostream::tellp() will fail. In this case, libpdf++ cannot calculate the object offsets.

The offset of each PDF object in the PDF file body will appear as a row in the cross reference table like this:

	0000000100 00000 n <EOL>

The first column in the table is the offset. It must be 10 digits and padded with zeros. The second row is the generation number. The PDF specification has a good description on it, but I don't quite really understand. In libpdf++ I have hard-coded it to be zero and it works. The third one is an operator, which is hard-coded to 'n' in libpdf++. Please note the there must be a space between the 'n' and the "end of line" (EOL) character, or else the file cannot be parsed.

The fourth part of the file is the trailer. It enables an application reading the file to quickly find the cross-reference table and certain special objects. The format of the trailer is in the form of:

	trailer
	<<	key1 value1
		key2 value2 
		...
		keyn valuen
	>>
	startxref
	Byte_offset_of_last_cross-reference_section
	%EOF

The key-value pairs specifies the "special objects" in the PDF file. It includes:

Key Type Value

Size integer The total number of entries in the file's cross-reference table.

Root dictionary The catalog dictionary for the PDF document contained in the file.

Info dictionary The document's information dictionary.

In libpdf++, the root and info fields of the trailer is an indirect reference to the catalog and info objects in the PDF file body. Their format will be discuss in the document structure section.

Generated on Sun Feb 2 09:17:07 2003 for libpdf++ by

1.2.16

Key	Type	Value
Size	integer	The total number of entries in the file's cross-reference table.
Root	dictionary	The catalog dictionary for the PDF document contained in the file.
Info	dictionary	The document's information dictionary.