RTFM. Or how I made MySQL go really, really fast…

One of the afflictions we untrained, shoot from the hip, make it up as you go along programmers suffer from is a tendency to get stuck in our ways. If it works, you keep doing it. Learning happens when things don’t work, and there’s seldom enough time to go back and retrofit code for new learning.

This affliction hit me in a big way this week.

My little project of the day was to take a database of 700,000-odd Canadian postal codes and turn them into an ESRI shapefile that I could load into MapInfo. Or, in simpler terms, I needed to plot postal codes on a map.

Using PHP/MapScript, I wrote a PHP script to loop through the database of postal codes, grab the required information — postal code, place name, province, latitude and longitude — and add each, in turn, to a map layer.

And it worked.

Except that it worked really, really slowly.

It started off quickly enough — zooming through the first 10% of the postal code file in a couple of minutes. But once it got more than 50,000 points in its belly, things slowed to a crawl, and adding each point to the map layer was taking 5 to 10 seconds. I left the script running over the weekend, and it was only halfway through the “J” postal codes in Quebec, alphabetically, when I returned on Monday morning.

My immediate suspicion was that the more points I was adding to the map layer, the slower it was to add new ones. So I tweaked the part of the script related to adding points to the map layer. But things still ran just as slowly. I stripped out even more fluff, and made the code even more efficient. Still got bogged down. I switched the code to created ten individual layers, one for each province. No improvement.

This morning, with a clearer head, I though “maybe it has nothing to do with the map layer part of the code — maybe it’s something else.” So I stripped out all of the map-related code, leaving me with, essentially, a script that looped through 700,000 records in a database and displayed its progress. Much to my surprise, the code ran just as slowly as when it was creating map points. So the problem was database related.

To the manual.

I’ve written hundreds of thousands of lines of database-related PHP code over the past 5 or 6 years. And the vast majority of the queries have returned less than a dozen records. Which is far less than 700,000.

Right here in the PHP manual it says, about the mysql_result function I’ve always used to grab results:

When working on large result sets, you should consider using one of the functions that fetch an entire row (specified below). As these functions return the contents of multiple cells in one function call, they’re MUCH quicker than mysql_result(). Also, note that specifying a numeric offset for the field argument is much quicker than specifying a fieldname or tablename.fieldname argument.

So I modified the code to use mysql_fetch_row instead.

And now I can digitize the entire country in about a minute.

Which, assuming the originally script would have ever completed, is about a 300,000 times improvement in speed. Presumably that’s why the word MUCH is capitalized in the PHP manual!

Moral of the story: never make assumptions about where problems lie, never assume you know everything. And read the manual.

Comments

Daniel Von Fange's picture
Daniel Von Fange on August 24, 2004 - 15:53 Permalink

Hmm. I have a modified version of your last line. Mine says: “Read the manual — when things aren’t working.” ;)

David Richardson's picture
David Richardson on August 24, 2004 - 16:27 Permalink

On a similar note, read Tim Bray’s post today.

I once found myself processing Census data stored in a Lotus Notes database. The initial time to process of ~26 hrs was reduced to 14 minutes by reading large chunks of the data into RAM first.

Justin O'Brien's picture
Justin O'Brien on August 25, 2004 - 03:23 Permalink

Never trust yourself,” I always say. Usually it doesn’t matter but once in a while…

Oh, what the heck, one usually doesn’t learn best by getting it right, but by balling it up but good (Unless you’re learning to fly)!

Steven Perry's picture
Steven Perry on August 27, 2004 - 07:36 Permalink

I tend to always use the mysql_fetch_assoc() function, just cause it’s easier to make sense of later as it uses the fieldnames instead of the offsets.

JB's picture
JB on April 5, 2008 - 20:49 Permalink

Great Post. I am new to MapScript/PHP and I have been trying to do almost the same task that you have accomplished. Could you post a sample of the code that you used?