Dates in the Info dictionary are written as UTF-16BE instead of ASCII/PDFDocEncoding, causing interoperability issues
When updating Info-dictionary date fields using update_info or update_info_utf8, pdftk-java writes a date as a UTF-16BE text string with BOM instead of an ASCII string. This causes interoperability issues with ExifTool, which does not normalize this date into a standard human-readable form.
Since the PDF date format uses only printable ASCII characters, UTF-16BE encoding is unnecessary and leads to inconsistent encoding within the same Info dictionary.
To reproduce, grab https://pdfobject.com/pdf/sample.pdf and run pdftk sample.pdf update_info <(echo -e "InfoBegin\nInfoKey: CreationDate\nInfoValue: D:199812231952-08'00'") output sample_with_date.pdf
ExifTool shows ModifyDate (which has not been changed) normalized and therefore somewhat human-readable but CreationDate as a raw PDF string, which is way less human-readable:
$ exiftool -a -G sample_with_date.pdf | grep "PDF.*Date"
[PDF] Modify Date : 2008:07:01 05:24:47Z
[PDF] Create Date : D:199812231952-08'00'
According to mutool, CreationDate is written as a UTF-16BE hex string:
$ mutool show sample_with_date.pdf trailer/Info | grep "Date"
/ModDate (D:20080701052447Z00'00')
/CreationDate <FEFF0044003A003100390039003800310032003200330031003900350032002D003000380027003000300027>
A hex dump confirms the presence of the UTF-16BE byte-order mark and null bytes:
$ hexdump -C sample_with_date.pdf | grep -A3 "Dat"
000045c0 20 28 50 61 67 65 73 29 0a 2f 4d 6f 64 44 61 74 | (Pages)./ModDat|
000045d0 65 20 28 44 3a 32 30 30 38 30 37 30 31 30 35 32 |e (D:20080701052|
000045e0 34 34 37 5a 30 30 27 30 30 27 29 0a 2f 43 72 65 |447Z00'00')./Cre|
000045f0 61 74 69 6f 6e 44 61 74 65 20 28 fe ff 00 44 00 |ationDate (...D.|
00004600 3a 00 31 00 39 00 39 00 38 00 31 00 32 00 32 00 |:.1.9.9.8.1.2.2.|
00004610 33 00 31 00 39 00 35 00 32 00 2d 00 30 00 38 00 |3.1.9.5.2.-.0.8.|
00004620 27 00 30 00 30 00 27 29 0a 2f 50 72 6f 64 75 63 |'.0.0.')./Produc|
Notice that the original ASCII /ModDate (D:20080701052447Z00'00') is followed by a line feed, which is followed by /CreationDate (0xfe 0xff 0x00 0x44 0x00 0x3a 0x00 0x31 ...)
Instead, the updated CreationDate should be written as printable ASCII (PDFDocEncoding subset), consistent with the original ModifyDate and the PDF date string format: /CreationDate (D:199812231952-08'00').
PDF specifications define date strings as containing only ASCII characters:
- In PDF spec 1.3, https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.3.pdf, Table 3.21, a date is a string, and a string is specified at the beginning of § 3.2.3 as a series of bytes—unsigned integer values in the range 0 to 255.
- In PDF spec 1.7, https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf, Table 34, a date is an ASCII string.
- In PDF spec 2.0, https://developer.adobe.com/document-services/docs/assets/5b15559b96303194340b99820d3a70fa/PDF_ISO_32000-2.pdf, Table 35, a date is also an ASCII string.
A reader might notice that § 7.9.4 in specs 1.7 and 2.0 say that the date is a “text string”, which, when interpreted as a data type rather than as plain English, may use UTF-16BE. However, this is unnecessary for date fields, since they contain only printable ASCII characters.
This bug affects interoperability, as the same Info dictionary contains mixed encodings:
/ModDate (D:20080701052447Z00'00') % ASCII
/CreationDate <FEFF0044...> % UTF-16BE
Suggested fix: When writing Info-dictionary date fields, pdftk-java should write printable-ASCII strings (a PDFDocEncoding subset) instead of UTF-16BE when the string contains only printable ASCII characters. If a character outside the printable ASCII is present in an input date string, the string is not a valid PDF date (in which case the string might still represent a date, but not in English, e.g., “четверг, 19 февраля 2026 г., 23:02:57 МСК”, a case for an error/warning message or for a translation/conversion attempt).
I use pdftk-java 3.3.3 and openjdk 21.0.10 2026-01-20 on Debian trixie.
Downstream issue report: https://bugs.debian.org/1128417 .
PS. More generally, also other fields from the Document Information Dictionary would profit from ASCII whenever the input is representable in ASCII. For example, pdftk sample.pdf update_info <(echo -e "InfoBegin\nInfoKey: Title\nInfoValue: sample") output sample_with_title.pdf && grep -a sample sample*.pdf currently does not find sample_with_title.pdf but would do so if the rewritten title “sample” were stored as ASCII. Last but not least, a tiny bit of space would be saved.