Regular readers may recall a project I undertook last month to create a montage of covers of The Guardian newspaper from 1912 to mark the newspaper’s 125th anniversary. It only seemed appropriate that, when 2012 ended, I would do the same thing for the 2012 volume of the paper.

So I wrote some code – you can grab it here and try it out for yourself – to scrape cover thumbnails from the Pressdisplay.com site where they are cached for use in presenting the digital edition of the newspaper, then stitched the images togother using ImageMagick and the result looks like this (you can download a 62 MB JPEG image or look at this version in zoom.it if you want to explore in more detail):

The 2012 covers from The Charlottetown Guardian.

Just in case you missed a point buried in the comments, I’m doing my Hacker in Residence blogging over at http://hack.ruk.ca/ (a server that, despite the domain, is hosted at UPEI and will shortly be blessed with an institutional name).

While I’m usually loathe to fork a blog – part of the appeal, I think, of this space is its wide-ranging coverage of different topics – I decided it better to spare the readership here from the minutiae of discussion of repositories and geopresence archiving and diagrams of my university office.

If, however, you’re into such minutiae, come on over: read the blog, or subscribe to its RSS feed.

So, apparently, Islandora isn’t quite the “point Drupal at some documents and it magically transforms them into a repository” solution I’d been imagining in my dreams. And so my visions of “take a collection of Homburg Invest insolvency PDF files and turn them into a repository” got bogged down, yesterday, in issues like “global deny-apim policies,” exacerbated by the fact that the virtual server I’m using here isn’t a “fresh install,” but rather a clone of an earlier Islandora server with some loose edges left to be tied up. I’m sure I’ll get there with Islandora shortly but, in the meantime, I had some PDF files I wanted to dive into.

Which led me to Evernote, a web app and desktop application that’s as close as you can come to a “repository” without actually calling yourself a repository.  I’ve been using Evernote for several years on my Mac, my iPad, my iPod Touch and my Windows Phone to manage my personal documents: I dump all my bank statements, household bills and other ephemera of daily life into Evernote. The app provides me with handy ways of organizing these notes: not only can I tag notes, and organize them into “notebooks,” but Evernote also does a kind of OCR on them to enable full-text search of any text – even scanned images – in any note.

So, for example, if I’m trying to find a receipt for filling up my rental car with gasoline at the Gulf station at Logan Airport in Boston, I can just search for keyword “gulf” and Evernote shows me all the notes containing that keyword:

An Evernote search through my personal documents.

So, as a before-I-figure-out-Islandora way of searching my collection of 210 documents related to the Homburg Invest insolvency, I decided to try out Evernote as a repository-in-everything-but-name: I simply dragged the PDF files into a new notebook in Evernote, waited a few minutes for the app to sync and OCR them, and then took it out for a ride. A search for keyword “Holman Grand,” for example, shows me all of the PDFs containing one or more references to that hotel project:

Evernote keyword search for Holman Grand.

It not only displays the list of 16 documents containing the keyword, but it also highlights the keyword inside the PDF files themselves:

Evernote keyword highlighting example.

What’s even nicer about Evernote is that with a single click I can make any local notebook a shared notebook, with a public URL, and at that URL all of the same searching and organizing features are available in a web interface, and anyone with Evernote on their local machine can add a local version of the notebook to their personal collection:

Sharing an Evernote notebook.

As a result, you can now search the collection of Homburg Invest insolvency documents in Evernote yourself (the only thing you’ll miss compared to the desktop application is keyword-highlighting inside PDF files).

I’m going to keep plugging away at Islandora, perhaps setting myself up with a Drupal 6 install so I can use the more battle-tested version of Islandora targetted at that version of Drupal; if nothing else, Evernote’s utility will be a good yardstick against which to measure whatever I can come up with there.

In September of 2011, Homburg Invest Inc. and various related companies filed a motion to obtain protection under the Companies’ Creditors Arrangement Act (CCAA). The firm of Samson Bélair/Deloitte & Touche Inc. was appointed by the court as monitor, and has been, ever since, publishing a rich collection of documents related to the companies and the process. In Prince Edward Island, Homburg Invest Inc. is well known for its development of the downtown Holman Grand Hotel, a significant development for many reasons, including the loaning of public money to the effort.

The complex web of companies involved in this action is difficult for the layperson to understand; the sheer breadth of documentation available – 210 as of this writing – should, in theory, many understanding possible, but with the data locked inside PDF files, and written in “inside baseball” investment terminology, this is a challenge.

This situation seems like one tailor-made for a repository driven by Islandora: in theory I should be able to ingest the PDF files, allow them to be searched in myriad ways, and to allow them to be annotated.

As a starting point, I need to harvest the PDF documents from the Deloitte Touche website which is easily done with the help of the wget command (easily installed on OS X if you don’t have it already; generally available pre-installed in most modern Linux distributions).

To grab all of the PDF files in a single go is as easy as:

wget -l 1 -r -A.pdf "http://www.deloitte.com/..."

I use -l 1 (that’s hyphen-el-space-one) to limit the depth of the download to a single level; this prevents wget from scraping other unrelated PDFs from other parts of the site. The -r makes the process recursive (perhaps not technically required in this situation because I’m only using a depth of 1) and the -A.pdf says “limit yourself to PDF files”.

Two minutes and 47 seconds later I have 210 PDF files sitting in a local directory. Next step: get Islandora running and figure out how to ingest the PDF files and add meta-data to them.

I spent a few hours in the library this morning, but doing any useful digital work was prevented by ergonomics issues. As someone who, in essence, has been typing professionally for 32 years, I have a stronger-than-normal concern with the design of my workspace, and having a proper keyboard, mouse and display as well as having a good chair and a desk that’s at the right height are all vitally important if I’m going to be productive. The setup of my room 322 office is getting better – I acquired a new keyboard on the weekend, and brought an Apple Mighty Mouse out of my strategic mouse reserve – but it’s still not optimal: I really need to find a way of getting a larger monitor, as staring at the tiny 13” laptop screen isn’t going to work, I need to take my wrist brace with my from the downtown office, and I need a pair of shoes (often overlooked: shoes affect foot placement, which ripples through to a whole host of other ergonomic issues). I’ve got a list, and I hope by my next visit things will be optimized.

The Ethernet jack in room 322 is still dead – I inquired today and was told that it’s hoped someone will look at it today – so I was also a little less wired (or, rather, wirelessed) in the office today, which contributed to a feeling of almost-but-not-quite-ness.

Fortunately the analog world rose to the challenge: a few weeks ago, in a blog post about the Woodside Press in Brooklyn, I came across a reference to the book In The Day’s Work:

When I first got into printing, I was given a book by DB Updike called “In The Day’s Work” which I recommend to anyone in any creative field, although this book is more for the practical printer. Basically it says take pride in your work, keep your shop clean and organized, and don’t let your clients make decisions because you, as the printer, are the expert. I should have read that more closely and taken it to heart.

I knew that it was a book that I had to read, for a whole host of reasons. I found the book in Google Books, but the full text wasn’t available; as it was a Harvard University Press book, I sent an enquiry to the Harvard Library via its excellent Ask a Librarian service, and a librarian quickly replied:

We don’t have access to the Google Books version either. You can use this form to request a copy of Harvard material, but such a request might run afoul of copyright restrictions. In any case, you will be notified as to the viability of your request.

Other than that, WorldCat shows it to be available at a large number of libraries, and if there is one near you and you can get access, you may be able to do your own copying or scanning.

Per their advice, I requested an interlibrary loan of the book from Robertson Library on January 3, and, just over a week later, I received notice that the book had arrived, and I picked it up at the circulation desk this morning. It’s a beautiful book, obviously typeset with care (understandable given the audience and the material); the copy I received came from Memorial University in Newfoundland, and a bookplate indicates that it came there as a gift from Harvard.

Otherwise, I’m giving some thought to Islandora, the Drupal front-end for Fedora Commons that UPEI is deeply steeped in, and, specifically, I’m thinking about mechanisms by which Islandora could become “more Drupal-like.” More on that soon.

Here’s an undated postcard from Charlottetown (from from here via here, originally from Doug Murray, Postal Historian), showing “Sunnyside,” along Grafton Street between Queen and University:

Sunnyside Postcard

And here’s a front-page illustration from the September 28, 1904 issue of The Guardian, showing the paper’s home on the occasion of the installation of a new press (a Cox Duplex Angle-Bar Press).

It is ironic that, in a front page story celebrating a new generation of printing technology, the newspaper spelled its own name wrong.

The Guardian is the middle building in the postcard above, later the home of Holman’s, which was later torn down to make the Grand Homburg Hotel. While generally destroying the block and “preserving” the Holman building in only the weakest sense, the Grand Homburg did maintain the general structure of the windows you see in Guardian illustration:

The Grand Homburg Hotel

We tracker down an old digital camera, the one that Catherine tool to Iceland in the fall of 2008, and I found 338 photos on it that I thought had been lost. Among them is this one, taken on a September afternoon, at Þingvellir.

[[Oliver]], aged 7 (almost 8) and I are running through the rain, a sudden unexpected downpour, as a double rainbow forms in the background (I don’t think we’d seen it yet; clearly Catherine had).

Rain, Son, Father, Rainbow, Iceland

In a somewhat irrational move, back in 2004 I migrated from using Flickr to share my photos to using an installation of the open source Gallery project. In theory this should have been a perfectly fine migration, except that, at least at the time, Gallery wasn’t really satisfying in any way other than being not Flickr, and so, after a while, I abandoned it. And, a while after that, the Gallery install didn’t survive the transition to a new machine, and so the content was lost.

This wouldn’t have been a serious issue but for two things: first, the photos involved were, at least emotionally, valuable to me, as they were of a trip I took with my brother Mike to the U.S. southwest and of my exploits as a blogger during the 2004 U.S. election, including photos from the Democratic National Convention. There are links over that summer to those photos from posts here that have been broken for almost a decade, and I decided that I wanted to fix that.

Fortunately, the Wayback Machine has a complete copy of the site in its archive and so all that remained was getting the images out of that archive and into somewhere — likely Flickr — less susceptible to decay.

And, more fortunate still, there’s a handy tool called Warrick designed to do just that — to resurrect websites from archives like the Wayback Machine so they can live another day.

Warrick is working its way through scraping the site out of the Wayback Machine as I type, but should you wish to resurrect your own dead site from the past, here’s a quick summary if how I got it working.

  1. To ensure I had a relatively modern version of Linux at hand, I started up an Amazon EC2 instance, using the stock Amazon Linux AMI.
  2. Installed some missing parts inside the instance:
sudo yum  -y install
sudo perl -MCPAN -e 'install HTML::TagParser'
sudo perl -MCPAN -e 'install HTML::LinkExtractor'
  1. With these in place, then it was simply a matter of:
chmod +x ./warrick.pl
./warrick.pl "http://photos.reinvented.net"

The result is the Warrick starts up and pulls a complete mirror of the archived site from the Wayback Machine and dumps on on the local disk. From there I can pull the full-sized JPEGs out of the Gallery installation, and upload them wherever I like.

One of my longtime interests has been “personal telemetry,” or, in the current vogue, “the quantified self.” Since my early involvement with the Plazes project I’ve had a particular interest in “geopresence” – the record of where I’ve been, when. My breadcrumbs, in other words.

Over the years I’ve been dropping digital breadcrumbs in a variety of ways that I hope to aggregate, archive and develop tools for the exploration of:

  • 10,973 Plazes check-ins from 2004 to 2012.
  • 2,176 Foursquare check-ins from 2009 to present.
  • 6,245 Google Latitude records from 2010 to present.
  • Some proportion of the 20,010 tweets I’ve posted to Twitter (tweets where my Twitter client has added geolocation data).
  • Some proportion of the 34,839 photos in my iPhoto library (photos with geolocation in the EXIF data).
  • Some proportion of 60 months worth of Metro Credit Union bank statements (transactions that I can attach to a specific place).
  • Some proportion of 48 months worth of personal MasterCard statements (transactions that I can attach to a specific place).
  • Some proportion of 60 months worth of corporate Visa statements (transactions that I can attach to a specific place).

As a starting point, I’d like to work to convert all of these streams into KML Placemarks, which seems like a nice widely-adopted XML standard that can capture both the time and location of a “geopresence.” Here’s my first Foursquare checkin, for example, as a Placemark:


    Reinvented HQ
    
      Not sure that I completely understand
      Foursquare - seems like Plazes,
      but harder to use
    
    Mon, 26 Oct 09 18:12:56 +0000
    Mon, 26 Oct 09 18:12:56 +0000
    1
    
        1
        relativeToGround
        
            -63.12968790933351,46.236265592340494
        
    

This was easy to grab because Foursquare has a handy page that will export all your checkins directly into a KML file. Twitter appears poised to release the entire back-catalog of tweets for each user any day now – there were tests of this in the wild in late 2012. When Nokia shut down Plazes they allowed Plazes users to export their complete history as delimited ASCII. Google Latitude provides for of history as KML (albeit via a little URL hacking to get the date range right). The other streams will require some PDF file parsing and some intellligence to convert credit card and bank statement description lines into georeferenced locations.

My Foursquare location history in Sweden and Denmark.

When it came time to choose a location for my office in Robertson Library I had two options: joining Peter Lux in a relatively cavernous sunlight space formerly occupied by various technicians and programmers, and, around the corner and down a hall, a decidedly uncavernous, non-sunlit 50 square foot research cubicle. I choose the tiny cubicle because, despite the obvious advantages of proximity to Peter, to say nothing of the advantages of sunlight, I wanted to ensure that I could establish a “sense of place” for my base of operations, and that’s something that’s hard to do in an open-plan “bullpen”.

Given that my focus as Hacker in Residence is going to be primarily digital, you might very well ask why I need an office in the library at all. I’ve certainly asked myself this question, as the logistics of slogging myself up to campus, laptop in tow, from my already-rather-comfortable perch in my day-job Reinventorium are, given that my life is otherwise restricted to a small 3-block area downtown, somewhat onerous.

But I decided that it was important to be as much physically embedded as digitally embedded: I want to be able to work among the librarians, technicians, researchers, staff and students, and to be able to listen to and observe what they do, I want to understand more about what value having a physical library brings in a digital age, and I don’t want to limit myself only to the digital realm in my various experiments and enlivenings.

And so I’m set up in Room 322, tiny and, to my mind, perfect.

The office is just down the hall from the stacks, specifically the shelves that hold books from Library of Congress classification LB “Theory and practice of education” to ML “Literature on music.” Which is not a bad intersection at which to carve out a niche.

One of the other great benefits of being physically embedded is daily exposure to the eccentric collection of physical artifacts adorning the walls of the library. There are maps and plaques and new book displays and, around a corner here and a corner there, unusual bits of art like this uncredited work installed in a break in the concrete.

Taking a page from my father’s book – he is a great maker of maps and diagrams – one of my first acts, upon getting settled in my new office, was to borrow a measuring tape and take measurements, and then to use these to make a Sketchup model of the office. You can see a rendering below, or grab the Sketchup file itself if you want to walk around it yourself.

Room 322, Roberston Library

About This Blog

Photo of Peter RukavinaI am . I am a writer, letterpress printer, and a curious person.

To learn more about me, read my /nowlook at my bio, read presentations and speeches I’ve written, or get in touch (peter@rukavina.net is the quickest way). You can subscribe to an RSS feed of posts, an RSS feed of comments, or receive a daily digests of posts by email.

Search