Postmortem in programming

Andres Felipe Vasquez Chica
3 min readOct 4, 2021

What is a post-mortem?

A post-mortem is where a team reflects on what went wrong with something they did, and documents it and/or amends their process to stop it happening again.

Did a software release go bad? Let’s break down a timeline of where things started to go wrong, and let’s reflect how we could have caught it earlier.

Here is the most important point: Post-mortems ARE NOT to assign blame. If we look at The Knight Capitol Group example, there should have been no way for one person to forget something and cripple the company.

Where was the quality assurance process where someone checked the technician’s work? Did they test this before going to production? Were there no automatic tests that ran before the deploy to production succeeded?

You should be finding process failures not personal failures.

Why should we do a post-mortem?

So we can stop making the same mistake over and over!

We provide more robust, bug free, stable software by learning how we failed in the past.

Most importantly, we can catch bugs we don’t even know about. And if we fix the processes that were prone to cause issues early, then those mistakes won’t even happen.

We want to learn every single lesson we can from the outages and emergencies to ensure they never happen again. Nothing is more valuable than experience.

Let’s look at some post-mortems together

I wanted a little bit of variety of companies and languages, so let’s review some from Google, Microsoft, and Flowdock.

A common post-mortem template will contain some key details like:

  • when it happened
  • who owns the post-mortem (and will do the analysis)
  • some lessons learned
  • a rough timeline of the emergency bug and some actions from the post-mortem

So let’s dive in.

Google

If you did a Google search between 6:30 a.m. PST and 7:25 a.m. PST on January 31st, 2009, you would have seen the message “This site may harm your computer” accompanying every single result.

What happened?

Google flags search results with this message if the site is known to install malicious software. This is a warning to protect Google users, and is collated with Google’s automatic algorithms, manual data entry, and a non-profit called StopBadWare.com.

One of the developers had updated the list and accidentally entered in a /, which resolved to every single site!

We know this one was human error, and because of this, Google implemented some tests and checks whenever that file changes. (And I haven’t seen it happen again since 2009!)

The full post-mortem can be found here.

Things to keep in mind when doing a post-mortem

I read a fantastic article by Adrian Hornsby, a Principle Developer Advocate from Amazon. In it, he discussed some things to avoid, and things to emphasize in order to write the best post-mortem if you are ever the owner of one.

Here are some things he suggested:

  • Don’t do post-mortems to blame people, teams, or organisations. Instead, focus on the process(es) that failed to allow these mistakes to cause mischief. Never do a post-mortem to punish someone. There’s no value in that, and you won’t find improvements.
  • Don’t assume events that happened were more predictable than they were. Only because they’ve happened are they now obvious. (Hindsight bias)
  • Make sure you go deep enough. Don’t just say we saw an error in the front end code. Really dig deep into the specific error and the conditions that led to it. How can the process catch this next time? A better QA process? More peer reviews? Better error logging?
  • If your resolution steps to stop it from happening are really vague like “improve documentation” or “train better”, you don’t understand it clearly enough to be more specific. Make these resolution steps actionable and concrete.
  • Try and keep your resolution steps to things that can be done in the short term. We are trying to stop these from happening again as soon as possible. Post-mortems can spawn longer term process changes, but that’s not what you’re focusing on at the moment. Don’t try to re-architecture something fundamental or try to change the language some huge codebase is written in.
  • Let your post-mortem challenge what your team currently believes to be true. Don’t assume because everyone believes something to be true that it is (Common belief fallacy)

--

--