“Incident” shouldn’t be a four-letter word
how to make the most of your outages and drive learnings for your team
The meeting was supposed to be short. The incident seemed a straightforward case of human error by the new engineer. Surely all they had to do was tell them not to do it again and move on, right?
But as the meeting went on, a series of perplexing questions began to be raised:
- “How did the new engineer have access to break this in the first place?”
- “What about our testing standards… How did they not catch this?”
- “Actually, the change shouldn’t have impacted this service!”
- “Luckily we were able to roll back quickly thanks to the instrumentation so-and-so wrote.”
- “Well so-and-so left last year, do other services have the same instrumentation?”
- “Wait! How are these 2 services even related?”
- “I’m not sure, so-and-so wrote this doc. It seems outdated and they’re no longer with this org.”
A familiar scene?
When an incident happens, teams often want to move on from it quickly; they find a root cause and add the expected action items to the ticketing queue. But research into high-performing teams conducting incident analysis1 show that by approaching an incident from a position of inquiry instead of incredulousness reveals more details to help an organization learn. While the desire to move past a failure with cursory understanding can be strong, there is always much more to the story.
I’ve recounted this same familiar story many times in various talks and cons. Here’s the principle lesson learned: we can all say we have ‘made mistakes’ and they’ve impacted other people. That’s beside the point - especially from an organizational perspective. The point is to ask how it was possible to even make that mistake at all.
When we focus on the “root cause”, “human error”, or “why” of an incident, we lose valuable opportunities to learn how the conditions came about for this to take place in the first place (not to mention all of the operational data that comes with it). In this sense, the incidents are major opportunities for your business. They are catalysts for understanding your organization; understanding the difference between how you think systems (tech, people, culture) work versus how they actually work in practice. And, in today’s world, where many digital services are critical services, learning from incidents is both a competitive advantage and a necessity for the safety of your users.
But as we try to get better and learn from our incidents, we don’t always know what to do or look for. As noted cognitive psychologist and researcher Gary Klein explains in his book on human performance in everyday working conditions2
Performance Improvement = Error reduction + Insight generation
Up to this point, you may have seen folks looking to improve over-indexing on the “error reduction” part of the equation by emphasizing incident metrics like Mean Time To Respond or number of incidents and not focusing on generating insights about what those incidents mean about how well the organization is able to cope with surprises. By investing in learning and in generating quality insights from each individual incident, you will be able to provide context around your incident metrics and show a more complete picture of performance improvements. This is why I started Jeli.
Okay, you’re convinced
Now it’s time to focus on learning from your incidents and change your company culture for the better. But where do you begin?
Five years ago, Etsy began to address this high barrier to entry by releasing the Debriefing Facilitation Guide for Blameless Post-mortems. This guide became the de facto reference for tech organizations wanting to complete post-incident activities to generate insights. Since then, and particularly for the past three years, myself and other passionate practitioners in the technology industry have been meeting in the Learning From Incidents community to learn and teach each other—about organizational behavior, human factors, Resilience Engineering, and beyond— implementing these learnings in our organizations, and then reporting back. This resulted in some of the most enlightening and thought-provoking conversations many of us have ever had.
Thanks to the hard work of this community as well as my experienced, dedicated team at Jeli.io, Howie: The Post-Incident Guide was released on December 8th, 2021. It is a free resource, meant to serve as a roadmap for folks interested in doing this work but who are unsure where to start. Regardless of what stage you are at in your Learning from Incidents evolution, this guide is for you.
Its goal is to provide you with an explanation of how to get the most out of your incidents including concrete strategies to help your company develop an incident analysis program. It will walk you through the stages of an investigation, teach you how to lead a learning review meeting, help you complete an incident report (that tells the story of “How” rather than “identify-troubleshoot-resolve”), integrate any additional findings and action items before you finalize and give practical advice for distributing your learnings with your organization. It’s made to deliver a document that is read rather than filed.
Our team of “Jelly beans” call this investigation process the How We Got Here Process (or, affectionately, “Howie” for short) and it is applicable to companies of any size and in any stage of maturity with their analysis programs. In fact, we built Howie to be customizable regardless of organization size, investigator skill, or severity of incidents. It’s a thorough guide, easy to follow completely, or adopt bits and pieces to work into your own processes.
Use our learnings, discover more of your own
We truly believe that incident analysis can be your organization’s secret weapon that will allow you to gain value from your incidents, but we know getting started can be a daunting task. We’ve been in your shoes and we’ve seen and heard how excruciatingly intimidating it is for many engineers to lead an incident review. This guide is your toolbox, packed with practical, easy-to-adopt strategies for getting you set up to do your first one.
Hopefully this post has convinced you about why you should care about learning from your incidents. Incident in the software industry shouldn’t be equated with a four-letter word. These unexpected operational surprises that happen to our systems are opportunities to learn about how we work as a team and how our system behaves under different kinds of conditions. Not taking the time to investigate them leaves valuable insights on the table, puts us at risk of employee burn-out, and “repeat” system failure.
We gave you a sneak peek of the how, but if you want to learn more, check out Howie: The Post-Incident Guide. You won’t be disappointed, and your investigations- whether they be 30 minutes or 3 hours- will change for the better.
Discussion
Sign in or Join to comment or subscribe