Several of my recent procrastinations came crashing together tonight. Earlier in the week I did some experiments scraping of Charlottetown City Council minutes. Back in November, I documented a method for grabbing PEI civic address data into a local database. And my Interactive Charlottetown Transit Map and RealCharlottetown.com projects have taken me into the world of “Google Maps mashups.”
So what happens when you take all those projects and jumble them together? The Charlottetown City Council Minutes: Addresses Mentioned map. It looks like this:
Here’s how I made it:
- Because of my earlier RSS experiments, I already had a MySQL table containing the web addresses of Charlottetown City Council meeting minute PDF files — basically a list of the links you’ll find here. So I began by writing a little script to grab each of the PDF files and store it locally.
- Next, I used pdftotext (part of Xpdf) to convert each of the PDF files to a plain old ASCII text file.
- I extracted the names of all of the streets in Charlottetown from my database of PEI civic addresses, and then ran through each meeting minutes file using a PHP script looking for occurrences of a number followed by the first word in each of the 496 streets in the city — like “84 Fitroy” or “100 Prince”.
- For every match found, I looked up the civic address in my database, and if I found a latitude and longitude, assumed it was a valid civic address, and I inserted an entry in a new MySQL table recording the location, the address, and the minutes file PDF I found it in.
- Using the RealCharlottetown.com Google Maps-making code as a starting point, I created a “mash up” script in PHP that plots each of the 265 addresses I found on a map.
The process is, of course, imperfect. My address matching could be better. My process for grabbing the “excerpt” of the minutes could use some work. Sometimes the minutes don’t record a street number, so I don’t pick up a match. And of course the minutes don’t necessarily record every mention of an address at council meetings. But it’s still a pretty neat way of visualizing council business geographically.
Take note that the page might take a slightly long time to load, especially in Safari. Firefox might complain “a script on this page is taking too long” (you can just click “Continue). And I haven’t tested this, as yet, in any version of Internet Explorer.
I’ll cobble together an open source release of the source code for all this ASAP. Comments welcome.