Busting Data out of PDF Forms

Peter Rukavina

For future reference, mostly my own.

I was given a disk with almost 100 PDF files containing form data that I wanted to be able to analyze. Every one of the PDFs had an “owner password” assigned – a password I wasn’t given – that prevented copy-and-paste and various other automated things I might wish to do to the files.

I hasten to add that I have permission to  use the PDF files, they were given to me specifically for this purpose, and the presence of the password is an inconvenient bureaucratic hurdle; I’m not trying to “crack” anything I’m not supposed to have access to.

To work around this, I did the following:

  1. Found the password using pdfcrack. The password turned out to be a common dictionary word, so using a word list like those found here allowed me to find the password in a few seconds.
  2. Removed the password from the PDF files using qpdf along with this helpful shell script.
  3. Using pdftk I extracted the form data from the files like this:
for file in *.pdf; do
    set -e
    echo "Dumping $file"
    pdftk "$file" dump_data_fields_utf8 > "text/$file.txt"
done

What I ended up with was a folder filled with text files with the data that had been entered into each of the PDF files. Now I can load that up into a spreadsheet or database for analysis.

Add new comment

Plain text

  • Allowed HTML tags: <b> <i> <em> <strong> <blockquote> <code> <ul> <ol> <li>
  • Lines and paragraphs break automatically.

About This Blog

Photo of Peter RukavinaI am . I am a writer, letterpress printer, and a curious person.

To learn more about me, read my /nowlook at my bio, listen to audio I’ve posted, read presentations and speeches I’ve written, or get in touch (peter@rukavina.net is the quickest way). 

I have been writing here since May 1999: you can explore the 25+ years of blog posts in the archive.

You can subscribe to an RSS feed of posts, an RSS feed of comments, or a podcast RSS feed that just contains audio posts. You can also receive a daily digests of posts by email.

Search