Gergely Orosz newsletter.pragmaticengineer.com

Inside the longest Atlassian outage of all time  ↦

Gergely Orosz did an excellent job detailing the ins & outs of Atlassian’s epic outage:

Hundreds of companies have no access to JIRA, Confluence and OpsGenie. What can engineering teams learn from the poor handling of this outage?

The TL;DR on the cause of the outage is a script that was supposed to “mark for deletion” some records also had “permanently delete” functionality and was run against a wrong list of IDs, improperly deleting 400 of their customers. Oh, and their backup restore process is really good at doing all customers, but not a subset. Ruh roh!

Lots to learn here, and Gergely puts a fine point on the biggest takeaways. A must-read!


Discussion

Sign in or Join to comment or subscribe

Player art
  0:00 / 0:00