For future reference, mostly my own.
I was given a disk with almost 100 PDF files containing form data that I wanted to be able to analyze. Every one of the PDFs had an “owner password” assigned – a password I wasn’t given – that prevented copy-and-paste and various other automated things I might wish to do to the files.
I hasten to add that I have permission to use the PDF files, they were given to me specifically for this purpose, and the presence of the password is an inconvenient bureaucratic hurdle; I’m not trying to “crack” anything I’m not supposed to have access to.
To work around this, I did the following:
- Found the password using pdfcrack. The password turned out to be a common dictionary word, so using a word list like those found here allowed me to find the password in a few seconds.
- Removed the password from the PDF files using qpdf along with this helpful shell script.
- Using pdftk I extracted the form data from the files like this:
for file in *.pdf; do set -e echo "Dumping $file" pdftk "$file" dump_data_fields_utf8 > "text/$file.txt" done
What I ended up with was a folder filled with text files with the data that had been entered into each of the PDF files. Now I can load that up into a spreadsheet or database for analysis.
Add new comment