Busting Data out of PDF Forms

For future reference, mostly my own.

I was given a disk with almost 100 PDF files containing form data that I wanted to be able to analyze. Every one of the PDFs had an “owner password” assigned – a password I wasn’t given – that prevented copy-and-paste and various other automated things I might wish to do to the files.

I hasten to add that I have permission to  use the PDF files, they were given to me specifically for this purpose, and the presence of the password is an inconvenient bureaucratic hurdle; I’m not trying to “crack” anything I’m not supposed to have access to.

To work around this, I did the following:

  1. Found the password using pdfcrack. The password turned out to be a common dictionary word, so using a word list like those found here allowed me to find the password in a few seconds.
  2. Removed the password from the PDF files using qpdf along with this helpful shell script.
  3. Using pdftk I extracted the form data from the files like this:
for file in *.pdf; do
    set -e
    echo "Dumping $file"
    pdftk "$file" dump_data_fields_utf8 > "text/$file.txt"

What I ended up with was a folder filled with text files with the data that had been entered into each of the PDF files. Now I can load that up into a spreadsheet or database for analysis.