Blameless Postmortem Culture in Software Engineering
Imagine the following scenario:
One day one of your colleagues told you that there is a bug in production and the entire system suddenly stopped running. So he asked for your help to identify the issue. Now you and your teammates are trying to find out what caused the entire system to fall apart. After some digging it’s finally revealed that there are some lines of code which are actually buggy and they found out you wrote those lines of code.
So who is to blame in this situation? Is it you? It might seem obvious that since you wrote those buggy codes you are the sole person to blame. But is that true? Let’s pause for a moment and think about this whole scenario.
Let’s assume that since you wrote this code you are to blame for this. But wait a minute. Your code goes through a lot of intermediate processes before it goes live, right? Like some of your colleagues must have reviewed your code. Since they didn’t find any bug in your code, they are equally to blame. QA tested your code. They didn’t find any issues either. So the thing is if we want to blame anyone/anything it’s the entire process involved in writing code to reviewing code to testing code to deploying it in production.
Now let’s come to the point what is Blameless Postmortem. If any critical bug is found in production, first we need to identify the bug rather not blaming the developer whose code might have generated the bug. Blaming anyone for writing buggy code doesn’t produce anything good or productive. Rather the person might feel demotivated. The other team members might feel insecure by thinking what would happen if I write buggy codes in future. Therefore the best practice should be without blaming anyone first finding out the actual issue(postmortem) as a team and take necessary steps to solve it. As I said earlier if anyone or anything is to be blamed, it’s the entire process or system involved in it.
While finding out the issue, it’s a good practice to note down following things:
- What is the actual issue that caused the system failure
- What is the cause of that issue
- Why the issue couldn’t be identified (Like it might be identified from code review or automatic/manual testing)
- What should be done to avoid occurring such scenarios in future.
The Blameless Postmortem culture is maintained in tech giant companies like Google. By maintaining this culture a tech company can be aware of the flaws and limitations of its software development process and systems and can take preventive measures to avoid untoward situations in future.