In September of 2011, Homburg Invest Inc. and various related companies filed a motion to obtain protection under the Companies’ Creditors Arrangement Act (CCAA). The firm of Samson Bélair/Deloitte & Touche Inc. was appointed by the court as monitor, and has been, ever since, publishing a rich collection of documents related to the companies and the process. In Prince Edward Island, Homburg Invest Inc. is well known for its development of the downtown Holman Grand Hotel, a significant development for many reasons, including the loaning of public money to the effort.
The complex web of companies involved in this action is difficult for the layperson to understand; the sheer breadth of documentation available – 210 as of this writing – should, in theory, many understanding possible, but with the data locked inside PDF files, and written in “inside baseball” investment terminology, this is a challenge.
This situation seems like one tailor-made for a repository driven by Islandora: in theory I should be able to ingest the PDF files, allow them to be searched in myriad ways, and to allow them to be annotated.
As a starting point, I need to harvest the PDF documents from the Deloitte Touche website which is easily done with the help of the wget command (easily installed on OS X if you don’t have it already; generally available pre-installed in most modern Linux distributions).
To grab all of the PDF files in a single go is as easy as:
wget -l 1 -r -A.pdf "http://www.deloitte.com/..."
I use -l 1 (that’s hyphen-el-space-one) to limit the depth of the download to a single level; this prevents wget from scraping other unrelated PDFs from other parts of the site. The -r makes the process recursive (perhaps not technically required in this situation because I’m only using a depth of 1) and the -A.pdf says “limit yourself to PDF files”.
Two minutes and 47 seconds later I have 210 PDF files sitting in a local directory. Next step: get Islandora running and figure out how to ingest the PDF files and add meta-data to them.