Resurrecting a Lost Website

Peter Rukavina

In a somewhat irrational move, back in 2004 I migrated from using Flickr to share my photos to using an installation of the open source Gallery project. In theory this should have been a perfectly fine migration, except that, at least at the time, Gallery wasn’t really satisfying in any way other than being not Flickr, and so, after a while, I abandoned it. And, a while after that, the Gallery install didn’t survive the transition to a new machine, and so the content was lost.

This wouldn’t have been a serious issue but for two things: first, the photos involved were, at least emotionally, valuable to me, as they were of a trip I took with my brother Mike to the U.S. southwest and of my exploits as a blogger during the 2004 U.S. election, including photos from the Democratic National Convention. There are links over that summer to those photos from posts here that have been broken for almost a decade, and I decided that I wanted to fix that.

Fortunately, the Wayback Machine has a complete copy of the site in its archive and so all that remained was getting the images out of that archive and into somewhere — likely Flickr — less susceptible to decay.

And, more fortunate still, there’s a handy tool called Warrick designed to do just that — to resurrect websites from archives like the Wayback Machine so they can live another day.

Warrick is working its way through scraping the site out of the Wayback Machine as I type, but should you wish to resurrect your own dead site from the past, here’s a quick summary if how I got it working.

  1. To ensure I had a relatively modern version of Linux at hand, I started up an Amazon EC2 instance, using the stock Amazon Linux AMI.
  2. Installed some missing parts inside the instance:
sudo yum  -y install
sudo perl -MCPAN -e 'install HTML::TagParser'
sudo perl -MCPAN -e 'install HTML::LinkExtractor'
  1. With these in place, then it was simply a matter of:
chmod +x ./warrick.pl
./warrick.pl "http://photos.reinvented.net"

The result is the Warrick starts up and pulls a complete mirror of the archived site from the Wayback Machine and dumps on on the local disk. From there I can pull the full-sized JPEGs out of the Gallery installation, and upload them wherever I like.

Add new comment

Plain text

  • Allowed HTML tags: <b> <i> <em> <strong> <blockquote> <code> <ul> <ol> <li>
  • Lines and paragraphs break automatically.

About This Blog

Photo of Peter RukavinaI am . I am a writer, letterpress printer, and a curious person.

To learn more about me, read my /nowlook at my bio, listen to audio I’ve posted, read presentations and speeches I’ve written, or get in touch (peter@rukavina.net is the quickest way). 

You can subscribe to an RSS feed of posts, an RSS feed of comments, or a podcast RSS feed that just contains audio posts. You can also receive a daily digests of posts by email.

Search