The Art of the Meltdown Postmortem

Peter Rukavina

There have been two high-profile technical meltdowns in the last while, one at GitLab and another at Instapaper.

In both cases the underlying issues causing the problems were database related, and in both cases there were significant issues with both the backup regime and the emergency response routine.

Fortunately for the rest of us, in both cases the companies involved have responded, once the dust cleared, with detailed postmortems.

GitLab posted Postmortem of database outage of January 31:

On January 31st 2017, we experienced a major service outage for one of our products, the online service GitLab.com. The outage was caused by an accidental removal of data from our primary database server.

And Instapaper posted Instapaper Outage Cause & Recovery:

The critical system that failed was our MySQL database, which we run as a hosted solution on Amazon’s Relational Database Service (RDS). Here we’ll cover what went wrong, how we resolved the issue and what we’re doing to improve reliability moving forward.

This is a laudable trend, and one that’s of tremendous utility to the broader digital community. All digital systems fail. All disks fail. All DNS setups go wrong. Having procedures in place to deal with this is all you can do; and because it’s hard to imagine, in advance, how and why things will go wrong, gaining insight from the real world failures of others with a similar setup is some of the best education you can get.

I learned actionable things from both GitLab’s issue and Instapaper’s.

For example, I’ve added more redundancy to regular MySQL backups and, following the example of Marco Arment, have added an automatic process that, every day, launches a new EC2 instance, installs MySQL, imports the data dumped from Amazon RDS, and performs tests to check data integrity.

And, from Instapaper’s issue, I’ve learned about a hard limited of 2TB for older Amazon RDS instances, and about some of the challenges of migrating from legacy MySQL setups into Amazon’s Aurora.

Being open about failure, especially when the failures are as much human as technical, isn’t easy, and it goes against our natural impulses to try to contain and control information. So GitLab and Instapaper’s engineering teams and management deserve our thanks for realizing that being open is ultimately not only good for their business, but it’s good for the wider web.

Add new comment

Plain text

  • Allowed HTML tags: <b> <i> <em> <strong> <blockquote> <code> <ul> <ol> <li>
  • Lines and paragraphs break automatically.

About This Blog

Photo of Peter RukavinaI am . I am a writer, letterpress printer, and a curious person.

To learn more about me, read my /nowlook at my bio, listen to audio I’ve posted, read presentations and speeches I’ve written, or get in touch (peter@rukavina.net is the quickest way). 

I have been writing here since May 1999: you can explore the 25+ years of blog posts in the archive.

You can subscribe to an RSS feed of posts, an RSS feed of comments, or a podcast RSS feed that just contains audio posts. You can also receive a daily digests of posts by email.

Search