One of the afflictions we untrained, shoot from the hip, make it up as you go along programmers suffer from is a tendency to get stuck in our ways. If it works, you keep doing it. Learning happens when things don’t work, and there’s seldom enough time to go back and retrofit code for new learning.
This affliction hit me in a big way this week.
My little project of the day was to take a database of 700,000-odd Canadian postal codes and turn them into an ESRI shapefile that I could load into MapInfo. Or, in simpler terms, I needed to plot postal codes on a map.
Using PHP/MapScript, I wrote a PHP script to loop through the database of postal codes, grab the required information — postal code, place name, province, latitude and longitude — and add each, in turn, to a map layer.
And it worked.
Except that it worked really, really slowly.
It started off quickly enough — zooming through the first 10% of the postal code file in a couple of minutes. But once it got more than 50,000 points in its belly, things slowed to a crawl, and adding each point to the map layer was taking 5 to 10 seconds. I left the script running over the weekend, and it was only halfway through the “J” postal codes in Quebec, alphabetically, when I returned on Monday morning.
My immediate suspicion was that the more points I was adding to the map layer, the slower it was to add new ones. So I tweaked the part of the script related to adding points to the map layer. But things still ran just as slowly. I stripped out even more fluff, and made the code even more efficient. Still got bogged down. I switched the code to created ten individual layers, one for each province. No improvement.
This morning, with a clearer head, I though “maybe it has nothing to do with the map layer part of the code — maybe it’s something else.” So I stripped out all of the map-related code, leaving me with, essentially, a script that looped through 700,000 records in a database and displayed its progress. Much to my surprise, the code ran just as slowly as when it was creating map points. So the problem was database related.
To the manual.
I’ve written hundreds of thousands of lines of database-related PHP code over the past 5 or 6 years. And the vast majority of the queries have returned less than a dozen records. Which is far less than 700,000.
Right here in the PHP manual it says, about the mysql_result function I’ve always used to grab results:
When working on large result sets, you should consider using one of the functions that fetch an entire row (specified below). As these functions return the contents of multiple cells in one function call, they’re MUCH quicker than mysql_result(). Also, note that specifying a numeric offset for the field argument is much quicker than specifying a fieldname or tablename.fieldname argument.
So I modified the code to use mysql_fetch_row instead.
And now I can digitize the entire country in about a minute.
Which, assuming the originally script would have ever completed, is about a 300,000 times improvement in speed. Presumably that’s why the word MUCH is capitalized in the PHP manual!
Moral of the story: never make assumptions about where problems lie, never assume you know everything. And read the manual.