Gergely Orosz did an excellent job detailing the ins & outs of Atlassian’s epic outage:
Hundreds of companies have no access to JIRA, Confluence and OpsGenie. What can engineering teams learn from the poor handling of this outage?
The TL;DR on the cause of the outage is a script that was supposed to “mark for deletion” some records also had “permanently delete” functionality and was run against a wrong list of IDs, improperly deleting 400 of their customers. Oh, and their backup restore process is really good at doing all customers, but not a subset. Ruh roh!
Lots to learn here, and Gergely puts a fine point on the biggest takeaways. A must-read!