You can use pdftk to fill out PDF forms (thanks for the inspiration, Joe Rothweiler). The syntax is simple:
$ pdftk input.pdf fill_form data.fdf output output.pdf
where input.pdf
is the input PDF containing the form, data.fdf
is
an FDF or XFDF file containing your data, and output.pdf
is
the name of the PDF you're creating. The tricky part is figuring out
what to put in data.fdf
. There's a useful comparison of the Forms
Data Format (FDF) and it's XML version (XFDF) in the XFDF
specification. XFDF only covers a subset of FDF, so I
won't worry about it here. FDF is defined in section 12.7.7 of ISO
32000-1:2008, the PDF 1.7 specification, and it has been in
PDF specifications since version 1.2.
Forms Data Format (FDF)
FDF files are basically stripped down PDFs (§12.7.7.1). A simple FDF file will look something like:
%FDF-1.2
1 0 obj<</FDF<</Fields[
<</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
<</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
…
] >> >>
endobj
trailer
<</Root 1 0 R>>
%%EOF
Broken down into the lingo of ISO 32000, we have a header (§12.7.7.2.2):
%FDF-1.2
followed by a body with a single object (§12.7.7.2.3):
1 0 obj<</FDF<</Fields[
<</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
<</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
…
] >> >>
endobj
followed by a trailer (§12.7.7.2.4):
trailer
<</Root 1 0 R>>
%%EOF
Despite the claims in §12.7.7.2.1 that the trailer is optional, pdftk choked on files without it:
$ cat no-trailer.fdf
%FDF-1.2
1 0 obj<</FDF<</Fields[
<</T(Name)/V(Trevor)>>
<</T(Date)/V(2012-09-20)>>
] >> >>
endobj
$ pdftk input.pdf fill_form no-trailer.fdf output output.pdf
Error: Failed to open form data file:
data.fdf
No output created.
Trailers are easy to add, since all they reqire is a reference to the root of the FDF catalog dictionary. If you only have one dictionary, you can always use the simple trailer I gave above.
FDF Catalog
The meat of the FDF file is the catalog (§12.7.7.3). Lets take a closer look at the catalog structure:
1 0 obj<</FDF<</Fields[
…
] >> >>
This defines a new object (the FDF catalog) which contains one key
(the /FDF
dictionary). The FDF dictionary contains one key
(/Fields
) and its associated array of fields. Then we close the
/Fields
array (]
), close the FDF dictionary (>>
) and close the
FDF catalog (>>
).
There are a number of interesting entries that you can add to the FDF
dictionary (§12.7.7.3.1, table 243), some of which require a more
advanced FDF version. You can use the /Version
key to the FDF
catalog (§12.7.7.3.1, table 242) to specify the of data in the
dictionary:
1 0 obj<</Version/1.3/FDF<</Fields[…
Now you can extend the dictionary using table 244. Lets set things up
to use UTF-8 for the field values (/V
) or options (/Opt
):
1 0 obj<</Version/1.3/FDF<</Encoding/utf_8/Fields[
<</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
<</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
…
] >> >>
endobj
pdftk understands raw text in the specified encoding ((…)
), raw
UTF-16 strings starting with a BOM ((\xFE\xFF…)
), or UTF-16BE
strings encoded as ASCII hex (<FEFF…>
). You can use
pdf-merge.py and its
--unicode
option to find the latter. Support for the /utf_8
encoding in pdftk is new. I mailed a
patch
to pdftk's Sid Steward and posted a patch request to
the underlying iText library. Until those get accepted, you're stuck
with the less convenient encodings.
Fonts
Say you fill in some Unicode values, but your PDF reader is having trouble rendering some funky glyphs. Maybe it doesn't have access to the right font? You can see which fonts are embedded in a given PDF using pdffonts.
$ pdffonts input.pdf
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
MMXQDQ+UniversalStd-NewswithCommPi CID Type 0C yes yes yes 1738 0
MMXQDQ+ZapfDingbatsStd CID Type 0C yes yes yes 1749 0
MMXQDQ+HelveticaNeueLTStd-Roman Type 1C yes yes no 1737 0
CPZITK+HelveticaNeueLTStd-BlkCn Type 1C yes yes no 1739 0
…
If you don't have the right font for your new data, you can add it using current versions of iText. However, pdftk uses an older version, so I'm not sure how to translate this idea for pdftk.
FDF templates and field names
You can use pdftk itself to create an FDF template, which it does with embedded UTF-16BE (you can see the FE FF BOMS at the start of each string value).
$ pdftk input.pdf generate_fdf output template.fdf
$ hexdump -C template.fdf | head
00000000 25 46 44 46 2d 31 2e 32 0a 25 e2 e3 cf d3 0a 31 |%FDF-1.2.%.....1|
00000010 20 30 20 6f 62 6a 20 0a 3c 3c 0a 2f 46 44 46 20 | 0 obj .<<./FDF |
00000020 0a 3c 3c 0a 2f 46 69 65 6c 64 73 20 5b 0a 3c 3c |.<<./Fields [.<<|
00000030 0a 2f 56 20 28 fe ff 29 0a 2f 54 20 28 fe ff 00 |./V (..)./T (...|
00000040 50 00 6f 00 73 00 74 00 65 00 72 00 4f 00 72 00 |P.o.s.t.e.r.O.r.|
…
You can also dump a more human friendly version of the PDF's fields (without any default data):
$ pdftk input.pdf dump_data_fields_utf8 output data.txt
$ cat data.txt
---
FieldType: Text
FieldName: Name
FieldNameAlt: Name:
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: Date
FieldNameAlt: Date:
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: Advisor
FieldNameAlt: Advisor:
FieldFlags: 0
FieldJustification: Left
---
…
If the fields are poorly named, you may have to fill the entire form with unique values and then see which values appeared where in the output PDF (for and example, see codehero's identify_pdf_fields.js).
Conclusions
This would be so much easier if people just used YAML or JSON instead of bothering with PDFs ;).