Learning from incidents with Chris Evans & Stephen Whitworth, co-founders of incident.io (Ship It! #21)

All Episodes

Things go wrong all the time. We all make mistakes. And that is okay. What is not okay, is to think that it won’t happen, or that there will be someone else around when it does. In that moment, it doesn’t matter who wrote that module, package or microservice. But there is a better way to think about this, and there is an approach that makes people actually look forward to incidents.

It all starts with thinking of incidents as opportunities to learn, and then share those learnings with everyone, so that you can all improve. In this episode, Gerhard is joined by Stephen Whitworth and Chris Evans, incident.io co-founders, and former Staff Engineers at Monzo.

They get it, we get it, and now you can get it too.

Changelog++ members save 4 minutes on this episode because they made the ads disappear. Join!

56 minutes
Recorded Sep 9, 2021
Published Sep 30, 2021
Download (54MB)
Transcript
🎧 5,957

Featuring

Chris Evans – GitHub, LinkedIn, X
Stephen Whitworth – GitHub, LinkedIn, X
Gerhard Lazu – Website, GitHub, LinkedIn, X

Sponsors

Fly.io – Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.

PlanetScale – PlanetScale is the only serverless database platform you can start in an instant and scale indefinitely with unlimited connections. Never think about database servers again. Everything you want to control is available through the beautifully designed PlanetScale CLI. Learn more and start your database in seconds at planetscale.com

Honeycomb – Guess less, know more. When production is running slow, it’s hard to know where problems originate: is it your application code, users, or the underlying systems? With Honeycomb you get a fast, unified, and clear understanding of the one thing driving your business: production. Join the swarm and try Honeycomb free today at honeycomb.io/changelog

FireHydrant – The reliability platform for teams of all sizes. With FireHydrant, teams achieve reliability at scale by enabling speed and consistency from a service deployment to an unexpected outage. Try FireHydrant free for 14 days at firehydrant.io

Notes & Links

📝 Edit Notes

Linear - The issue tracking tool you’ll enjoy using
Loom - Record quick videos of your screen and cam
incident.io - Product Roadmap
CircleCI - Continuous Integration and Delivery
Heroku - Cloud Application Platfrom
Incidents are for everyone - Stephen’s favourite blog post
Learning from incidents - Formula 1 - Chris’ favourite blog post
Why more incidents is no bad thing - Gerhard’s favourite blog post

Gerhard, Chris & Stephen

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

So Gergely Orosz - and I may have gotten his name wrong; I’ll try at it again… Gergely Orosz - he tweeted in April about this new team that’s forming around the problem that they have been passionate about for some time now; so it was like a natural team that just got together. I was intrigued, as I usually am, that there was something there. I signed up, and shortly after I received the nicest emails from Stephen. And it read like this… It was short, to the point, friendly… Really nice. “Hey, Gerhard. Thanks for signing up. I’m a long-time fan of Go Time, Changelog. It was nice to see your name pop up. Just wondering what capacity you’re interested in Incident.io. Let me know. Thanks. Stephen.”

That was great. That was April; a few months have passed, a few more emails have been exchanged, a demo was had, which was really good; thank you very much for that. And Ship It launched - that happened as well in all this time… And I always wanted, at the back of my mind, to have you part of Ship It and part of the Changelog setup. And that happened. Episode 10 and 20 has more details how that happened and why it happened. And now, in episode 21, it’s finally happening. It’s a special moment where Chris and Stephen are joining us in person. Welcome.

Stephen Whitworth

Thanks for having us.

Hey. Good to be here.

So I’ll go straight to the point… Why Incident.io is important to others - why is it important to others? How does it help others?

[04:01] So what we’re building at Incident.io is the very best way for whole organizations to get involved in incident response. And I guess the context for why we think that’s important is – the world has massively moved on in the last few years; probably more than that. But essentially, organizations where they used to see things like technology [unintelligible 00:04:20.10] that’s no longer the case. These days, technology is deeply intertwined into organizations. Customers have high expectations of companies too, so they want every single service to be online all the time; downtime is just sort of like not acceptable. Along with that, customers have choice as well. So where in the past they might go “Well, my service is a bit rubbish, but I’m sort of stuck with it”, they can just leave. And it’s also [unintelligible 00:04:49.05] where everybody is in the same space. So where ten years ago everyone would be sat in an office, and something goes wrong and everyone sort of like piles into a room - that’s not the case anymore. People have moved into Slack.

So when you look at all those sorts of things rolled up together, the demand when things do go wrong are really high on people dealing with incidents… Whether that’s engineers who have to fix things, or whether it’s customer support people who have to get information to customers incredibly quickly and sort of have fingers on the pulse there… And fundamentally, it doesn’t feel like to us the tooling in this space has really kept pace with how people are operating. And so typically, what that means is people do one of two things. So they either will go “We have a bunch of tools that sort of help us in this area, and then we’ll write down on paper how we pool them together, and we manage to sort of marshal something into something that looks like a good incident response.”

Or at the other end of the spectrum you might get some leading companies who then try and write their own tooling to encapsulate that process a little bit. And fundamentally, we feel that shouldn’t be the case. People should be building this sort of thing themselves; you wouldn’t go out if you’re starting your company today and say “I’m gonna build a paging piece of software, because I need someone to be able to call me when an alert fires.” So we think there’s sort of a parallel here with incident response, and that’s really where I think the motivation for Incident.io came from. Essentially, there’s a problem to be solved, there’s a problem that pretty much every company has, and they’re solving them poorly, and we think we can do a much better job.

I really like your tagline, which is right on the homepage… “Playing the leading role, not all the roles.” That is a very interesting one. Can you expand a little bit? And we can compare what I understand and what you’ve meant by it.

Stephen Whitworth

Yeah, absolutely. So when stuff goes wrong in technology organizations, and it goes wrong fairly frequently - you get paged by PagerDuty or Opsgenie, and then you sort of get dropped into this white space where you need to define a process. And what often happens there is that, I guess, you’re floundering. There’s a lot of stuff to do. You might need to go and tell the executive that’s responsible for the area, but you also might need to SSH into the machine and reboot something, or simultaneously trying to investigate the logs and see how bad it is. And in reality, these are probably a few different roles. But the lack of having a sort of structured, automated way to pull apart your incident usually means that chaos ensues and you kind of take all of these roles on yourself. And what we’re trying to do is say, “No, you get to encode your process in the way that you’d like to respond to incidents into the tool”, and as a result, we can give those different responsibilities to different people… And including taking a lot of the process management onto our tool, so no other human has to do it, so you can really focus on the problem and not the process of actually working through the workflow. You get to focus on logs, or communication, or whatever it is a human is best at doing, as opposed to trying to follow a workflow under high stress, which - we just find that never really works that well.

[07:58] Yeah. I think that’s really powerful, and I’m wondering, from that perspective, what does the ideal incident workflow look like to you? Because a lot of these principles and a lot of these flows that you’re capturing are based on a lot of experience that you share, the founders. So you’ve seen many of these… But what does the ideal incident flow look like to you?

I think that’s a really pertinent question, and I think the answer is somewhat “It depends.” Our view is that there’s a set of core defaults that we think every company should follow. So we want to kind of encapsulate those in the product. But equally, every single company is different, so there’s things that different companies need to be able to imprint into the process, to say “For us, it’s really important when this thing happens, that we engage this team and pull them in.” And those sorts of workflows and automations are different wherever you go.

But if we look at the core of what good incident response looks like, it looks like keeping context all in one place, it looks like having very clear roles to be able to define who should be doing what, it looks like having a structured way to be able to coordinate your response… So everyone should know exactly who’s picking up what actions and when, so you’re not tripping over each other… And it looks like really good communication as well. So that’s like communication internally, within those people that are dealing with the thing that’s broken, it looks like good communication to other folks within your organization… So the exec that’s at home, that needs to stay in the loop, so that if he/she is called upon, in the heat of the moment they have the right information at their fingertips… But then also communication out to your customers. They’re often the last to know.

We see this a lot, where you jump on Twitter and you’re having an issue with something, and you sort of tweet whoever that is, and they come back and go “No, everything is fine” and their status page says the same, and 30 minutes later finally the information will come out. And all those kinds of things are just painful. So yeah, I think good response is built on all of those foundations, with the ability to tweak the bits that are most important to you.

I really like that answer. The reason why I like it is because you mentioned the guiding principles which are essential to good incident management… Less the flow, because it depends, and I know people don’t like hearing that, that it really depends. So as long as we agree on the principles, we know how to shape them to our context; that is really powerful. But I think you were going to say something, Stephen.

Stephen Whitworth

Yeah. We think about this a lot internally, and we like to think about this as sort of a scale from JIRA on one end, as a relatively unopinionated piece of software that you can stitch together into an incredibly powerful thing, [unintelligible 00:10:46.17] know how to do it… And a tool called Linear, which is the issue tracking tool of our choice, which is opinionated, fast… If it doesn’t work for you, it’s not going to flex to the way that you want to work. But if it does, it’s amazing. And we tried to place Incident.io consciously towards the linear end of the spectrum at the moment, which is we think there’s a few, like Chris mentioned, a few core principles to doing incident response really well… And we’re unlikely to flex on those. We’re unlikely to say “Incidents shouldn’t have leads, or they shouldn’t have actions”, or any of these sorts of things… But we realize that there is, like you say, things above the principles that change, such as policies, regulators that need to be contacted in certain situations… And we’re trying to build the core of the product as a very principled, opinionated piece of software, with the right kind of extension points that you can hook in.

Think of it much like a program that you’d build. You’d build your core abstractions, and then when you want to have end consumers, you give them a much smaller, more focused API surface that they can really just go and interact with the product in the right way.

[11:56] I really like that. And to come back to connect, to playing the leading role, not all the roles - what it meant to me is that you have experience in how these things happen, and you have years of experience dealing with incidents at the banks… And that’s important. When it’s about people’s money, when there is an incident, it hurts; people can’t pay with cards, and it’s really important that that actually works. And if there is a problem - and there will be problems - how quickly can you solve it? What can you learn from it? So playing the leading role in incident management is really important in today’s world, which is very complex. The systems are only getting more complex, so how does our experience keep up with the complexity? How do our learnings keep up with the complexity, and how do we share them? It comes back to these principles… How do you teach someone to incident-manage? It’s hard, because it depends; and yet, there is a way, and there is a way to instill these core principles and say “This is what’s important.” But what does it mean to you, for example?

One of the things I really liked, and I liked many things, but this is one that really stands out… It’s when there is an incident, you can choose every 30 minutes to be notified to give an update. It’s such a simple thing, but so important to keep people in the loop constantly, and you yourself to be reminded by the tool “Hey, it’s time to update.” And you may skip it, you don’t have to do it, but it’s a good thing to have. So it’s stuff like that, which was really powerful.

So I think that we get it, I think that I get it. When I say “we”, Changelog.com… Does that qualify, our logo, for your homepage? What do you think?

A hundred percent.

Stephen Whitworth

I think we can make that happen.

Yeah. [laughs]

Thank you. [laughs] Good. I like that.

Is this whole podcast recording just an elaborate way to get your logo on our homepage? Is that what it is?

That’s exactly what it is. That’s the only reason why we’re doing this, Chris. You got it!

It’s done now, we can wrap it up. Cheers, Gerhard.

Can you describe for us the context in which the Incident.io started? The idea, the team… How did it all began?

If we wind the clock back a few years now in fact, actually, I’d just joined Monzo. I was running their platform team at the time. And as part of that, I was just [unintelligible 00:15:15.07] with picking up responsibility over the on-call function at Monzo. So this is like the engineers that get called when something goes wrong, when the bank’s not working.

And when I picked it up, basically there were a bunch of relatively unhappy engineers who everytime they got paged were jumping into this one shared Slack channel, trying to navigate a pretty complex application. Banks are very, very complicated… And as a result of all those things, they were really struggling to get more engineers onto their on-call rotation.

[15:46] So I ended up building the most basic solution to try and make that process a little bit easier. The things that we were trying to solve were allow an engineer who has been paged into an incident to sort of take a little bit off their plate by creating a Slack channel automatically and pinging someone in customer support and saying “Hey, the engineers are dealing with it. If you need to communicate with them, use this thread.” And it was sort of built around the Lambda function, super-simple, very primitive.” But it worked really, really well. It sort of just took a tiny bit of effort off of people’s plate, and it sort of did wonders towards people wanting to jump into on-call, people being able to jump into channels and see the entire context of the incident, from end to end.

And then from that point onwards, it just became something that Monzo just continued to build on. And so over the time that I was there, it then became more of an application, and it sort of grew and grew, and then we eventually started speaking about it publicly, and then it led to us open sourcing it.

So I think all of that sort of culminated in Monzo having this tool that the entire organization started using, not just engineers. It was – people in customer operations were declaring incidents through this tooling, and people in money, when stuff went wrong there, were doing it… So yeah, it just became something that we were like, “This is great.” And I think that was the sort of early seeds, that was this sort of better way to deal with incidents… And I guess fast-forward a little bit, Stephen, Pete and myself, sort of all technical leaders at Monzo, in a lot of incidents… And I guess it kind of felt like there was space for someone to build something and actually share that with the world. And as I said, Monzo had open sourced what they called Monzo Response, and it was sort of good, and it worked well for Monzo, but when you look at what the software did, it’s similar to what a lot of other companies have done in that space when they’ve had to build something – they’d built something that’s just about good enough and just about fits the needs… And it has rough edges, because it’s sort of no one’s job to build that tooling and own that tooling.

So yeah, that was really what led to us coming up with the idea, it became something we worked on evenings and weekends… Monzo were great at supporting us doing that as well… And yeah, it sort of developed from there and just snowballed into this product that we have today.

Stephen Whitworth

Yeah, and I think the background for it starts, for me, at least when I co-founded a company back in 2015 called Ravelin… So we built credit card fraud detection software. We would be in the synchronous payment flow for apps like Deliveroo and Just Eat. So whenever there was an incident, it was automatically relatively high-impacting. And I remember, as someone that was on-call during that time, thinking about just the lack of automation that I was an on-caller having to essentially deal with, and creating channels, and telling the right folks, and going into customer channels and letting them know… And I feel at that time this thing sort of registered in my brain as “Is there any way to put in a credit card and have this problem solved for me?” And at that time there certainly wasn’t, and in 2021 it still is relatively debatable that that was solved as well… So that was another part of the genesis.

Yeah. The one thing which I’ve seen in Incident specifically and that attracted me to it is the simplicity, which to me, it speaks of iterations that had to happen for the idea to get to the point which it did. So seeing Incident in the first phases - I’m not sure that it’s opened up yet; like, you can sign up, register and request for access, but you can’t just like put a credit card in yet, and then start using it… I don’t know where that is. The point being that using it as a beta user felt way more advanced than a beta product I would expect it to be. What that meant is the experience - that’s what I’m always thinking about, what is the flow of this product? And it felt very polished. It felt simple. It felt like “Okay, things are missing.” I mean, you haven’t even launched it properly… But it felt ready. MVP – it’s like more than an MVP. And I liked that.

So what you’ve just told me explains why… It explains that you have been solving this problem in different capacities, in a different context, and now you’re bringing it to the masses. It gets really simple. And based on what I’ve seen - again, I don’t wanna spoil it for others, but I really liked it. It was great. Simple, to the point… Lots of opportunity, and I think that’s what you want with a new product, where it can go not necessarily having all the bells and whistles… Because actually that’s what in my opinion make products bad, when they do too many things. You don’t want that. So focus on the simplicity. And that story that you’ve just shared explains it really nicely, so thank you for that.

[20:26] Yeah. That simplicity isn’t an accident either. That’s very much an active product choice on our part. So something that we want to be true always is that you can install Incident.io into your Slack workspace and you can basically get going and start creating incidents with very little onboarding. And at the core of it, what that means is you need to know one slash command or one message shortcut to create an incident, and then at that point it’s just like Slack, but a little bit better.

So you’re juts in a channel; it’s not like a new product experience everyone’s got to learn. You’re in a channel, and what we try to encourage is a learning by doing type of approach to using the product. So rather than someone having to figure out everything all in one go, you’ll see someone create an action inside of a Slack channel and be like, “Hah. That’s really cool.” And we’ll give people pointers and nudges as to how they can do that. And this osmosis approach is very deliberate and sort of leads to this kind of organic growth and adoption across organizations. And again, that’s come through experience, of that being a way that it worked really well at Monzo. Nobody told someone in customer service that they should start declaring incidents in [unintelligible 00:21:30.09] But it sort of happened, because people saw the process, they were pulled into Incident when they were there as a by-stander or an additional piece of support, and they were like “This is great. This is the right way to solve this problem.”

I think you sort of start there, you start with like a Slack with benefits, and the product then layers things on top of that. So when you get to the point where you go “Do you know what? Our organization has grown, we have some complexity we need to navigate in incidents, so if I set a sev1 type incident, I want to create a ticket in a JIRA thing, so that someone who depends on that as a process - we can do that for you. We can automate that.” But you don’t need it from day one, and you can sort of layer up and build up this approach to get a very powerful product eventually, but with none of that sort of steep onboarding curve.

Stephen Whitworth

I think fundamentally we have an advantage, because we are building a product that we wanted to use when things were going wrong… I’ve seen a lot of people’s startups where they’re kind of searching for a problem and a pain point, and I think that that is a decent way to find it, but I think we’re just at an advantage from a product perspective of knowing that we have 12-18 months’ worth of stuff that we know we haven’t done yet, but we know that we really, really would want when stuff was going wrong. So as a result, I guess that gives us a bit of a benefit when we’re trying to build things, because we’re not having to search out and find the pain points.

Obviously, our customers are telling us what doesn’t exist and what they want us to add, but I think we have a decent nose for what’s painful as well.

I’m really glad that you’ve mentioned that, Stephen… And now what I’m wondering is how does Incident.io use Incident.io? What does that look like?

Stephen Whitworth

That’s like Inception, isn’t’ it? Turtles all the way down.

Yeah.

Stephen Whitworth

So we use it in a few different ways. Incidents is a kind of fuzzy concept to people. For some people, an incident is like the building is burning down, for example. It’s a terrible thing, and it happens once every six months.

Hopefully not… That’s terrible. A building is burning down every six months… No, thank you. [laughs] What kind of a building is that? That’s what I wanna know…

Stephen Whitworth

That was a stupid thing to say…

That was too funny… [laughs] Go on.

Stephen Whitworth

[23:42] So fundamentally, we have a different view. An incident is really just some kind of interruption that takes you away from what you were currently working on, because it demands a level of urgency for you to respond. So that might look like a particularly severe bug, it might look like a small outage, but it also might look like a really complicated deployment that you’re just about to do. As a result, we use it for a bunch of different use cases. So I’d say the first is more traditional service outages, 500s on the API sort of thing. We’re in the kind of unique position that if we’re having issues with our own product, that may inhibit our ability to use it, but most of the time everything just works totally fine on that. We also use it to figure out particularly complicated bugs, where we’re seeing errors in Sentry… And we’re not quite sure why, but we’re trying to lay out this sort of – think of it, I guess, like a notebook. A way of thinking about and reasoning about the problem.

So we have a functionality in Incident.io where if you pin something on Slack, or emoji-react with a specific emoji, that will get added to your incident timeline. So what you’re doing is you’re sort of diving into things, and when you find a particular point that is high signal or very useful for understanding what’s going wrong, we pin that as well. And that means that we have this record of what we’re trying to dig into… Which isn’t necessarily just an incident, but is a really, really useful way to use that increased collaboration, better communication side of the product.

So a few different ways… I think, Chris, you will also give a nice answer, that doesn’t include a building burning down every six months… [laughter]

Yeah, I think you’ve hit the nail on the head. I think using Incident.io incidents for low, low severity things has many benefits. It has the benefit of you just leaving a really, really good trail, so someone else who can come along and first of all see what you’ve done, if you’ve reached a solution and understand your thought process and learn a lot…

There was an engineer at Monzo who used to do this repeatedly, where he would dive into some of the gnarliest bugs, that would scare most people away… And it was just a fascinating read, being able to go “I have this channel. I can look at that, I can look at a timeline”, and you can sort of scan through and catch up on those things… It also acts as like a really nice, structured way to hand over work. So if you are picking up some lower-severity bug or issue in production, but you have to go somewhere, you can be like “Cool. I’ve left all the context in this channel. Pick it up and run with it kind of thing.” So I think all of those things kind of lean towards that… It’s just useful and helpful. There’s very few downsides. I think that’s the main thing - there is such a low cost to starting an incident. You’re talking one slash command, and you’ve then got everything at your fingertips. And lo and behold, if the worst does happen and you’re investigating and you go “Oh, this is really bad”, you are suddenly now already in the place where you need to deal with your incident, with a heap of context that people can then pick up and run with… And surrounded by all of the support and tooling that we’ve got in place there. So if you need to escalate to engineers, they’re a button away. If you need to communicate with your customers via your status page, the same sort of thing.

And that’s the approach we use at Incident.io. As Stephen says, we are just using it for any kind of structured, but interrupt-driven approach to dealing with things.

How many incidents have you had in your instance of Incident.io, do you know? Or do you wanna check?

I can tell you…

I can tell you that the Changelog.com Incident.io is at number four. So the next one would be the fifth one… In a few months. That’s been really, really good. And the thing which I would like to add is that the mentality shift which happened when it comes to viewing incidents is something positive, something to learn from… Like, literally, learning from failure. I loved that shift. Because it’s not a bad thing when it happens. I mean, okay, it is from some perspectives… But not from the perspective of the people that have to handle it. It’s something positive, something to share, it’s something to learn, it’s something to solve. It’s intriguing. I know this may sound controversial, but I’m actually looking forward to the next incident… And that’s a very weird thing to say, but it’s true, because I know what to expect. The flow is fairly easy. I know that value has been produced, in that it will be captured and others can reference what happened, why it happens, and so forth. So the whole negative side of something going wrong is being mitigated by this nice, simple tool. And I like it.

[28:18] Nice. Well, to answer your question from earlier, 91 incidents is what we have declared.

That’s a good one.

Yeah.

How many sev1’s?

We had eight major severity incidents for us.

Over how many months?

A year and a bit. No, maybe a year, something like that.

A year, okay. So one sev1 every month and a half. Okay, that’s interesting. So did this have to do with your production setup, with anything like that? Or what is a sev1 incident, I suppose is what I’m asking.

It’s a good question. So we have sort of like guideline text within the product which sort of helps to sort of steer you to set the right value… [unintelligible 00:28:55.21] We’ve actually had none that we’ve marked as like critical, the top-top severity. These are major, which is what we’d consider sort of seriously impacting, in some form…

And to give you sense of what some of these are like - a Slack is having an outage, which is four of these are “Slack is returning 500s”, or whatever, and we’re at the mercy of them, building on top of their platform… But we’d still consider it an incident, because we own that relationship with our customers, and it’s something that we’d wanna proactively reach out and let them know what’s going on… But yeah, I think in terms of roughly – very handwavy terms, the way we would rate incidents would be… Critical would be the entire app is completely down; you can’t access the dashboard, you can’t access anything through Slack, and it’s there for some prolonged period of time. That’s like the worst possible case of incident. Major would be some key component, some key feature or product flow within the product is not working, and it’s something we need to urgently, urgently all swarm on… And then Minor, which is our only other severity at the moment, is sort of everything else. So that is the big, big bucket of everything from “This is a super-minor, non-impacting bug that I wanna deal with in the open”, through to something sort of causing a minor problem for one customer, sort of thing.

Stephen Whitworth

I want to come back and touch on what you were saying earlier, Gerhard, which was around how your behavior with respect to incidents has changed from using our tool… That’s the goal. We are selling technology at one level, but with our most successful customers what we’re actually achieving is this sort of organizational change and acceptance of “Incidents aren’t as scary as we thought they were. They are a way for us to assemble a team of people together, and for us to approach that with this sort of shared mental model of how we’re thinking about this problem.” And as a result – I think Loom is a really good example here. We started off in that platform team being adopted by a lovely person called [unintelligible 00:30:51.22], who I’ll shout-out here… And now, a few months later, we have been used by 80% of the organization. And that is really a reflection of the fact that it’s not just about the engineering team anymore, it’s about customer support, it’s about sales, it’s about executives… Incidents are fundamentally social, and you need to build a product that acknowledges that and leans into it, and that is really where we’re trying to head. We’re not trying to build the best tool for SREs. SRE is important, they need tools, but we think that essentially the rest of the organization has been left out of these tools for too long, and we really want to build stuff that brings in for the rest of them. So I’m very excited to hear about your approach and your experience with us.

I’m wondering, how does the Incident.io production setup look like? You know what ours looks like, we’re very public about it… But what does your production setup look like?

Stephen Whitworth

It’s intentionally very simple. We run a Go app, which is just a single binary, on Heroku; so that runs all of our own infrastructure. We use Postgres as a backing store, GitHub stores all of our code, we run tests and deploy using Circle CI… I’m trying think – and a little bit of BigQuery and Stackdriver tracing and monitoring as well. So intentionally, trying to maintain as few moving parts as possible, and get very rich cloud providers to do that for us wherever we can.

Yeah. I think it’s [unintelligible 00:34:50.29] I’ve come from a world where I was responsible for everything, from the lowest level moving parts in your storage system, through to deployment tooling, and all these kinds of things, and it’s genuinely a wonderful experience being in a “serverless” environment, where we haven’t got a single server that we have to run and manage, which is lovely… Essentially, we get to focus all of our time on writing the code, which time and time again we used to ship features incredibly quickly… So - not uncommon for someone to raise a feature request, certainly in the early days, when we were in this very fast shipping and iterating kind of mode… Raise it in the morning, at 10 AM, and by lunchtime their product is in production. We’re still doing that today; we’re also working on some more longer-term, strategic things along the way.

That sounds amazing. That sounds like the dreamplace to be in when it comes to iterating, when it comes to shipping features out there, seeing how they work… So you mentioned a single Go binary - is that right? So no microservices, a monolithic Go binary… Is that what it is?

Stephen Whitworth

[35:57] Yeah. So it’s broken down by services internally. So a service would be responsible for maintaining actions, or listing and updating custom fields against an incident. So we sort of factored everything out internally, but the fact that everything is just in one binary makes testing, deployment, communication a whole lot easier. And this isn’t to say in the future this might not change, but there’s just something very refreshing about running a Go app on Heroku, connecting to Postgres, and just really not having to worry about a huge amount else.

Worth highlighting as well that there are multiple replicas of that single [unintelligible 00:36:32.27]

Of course, of course. Yeah, that was an important one. What about the assets? Do you bake them in in a single Go binary as well? Is that how you deploy assets?

Stephen Whitworth

This is [unintelligible 00:36:45.11]

Yeah, for like the website. Like, Incident.io, when it loads up, all the assets are CSS, JavaScript, the images… Where do they live?

Stephen Whitworth

They’re served through the Go binary as well. So we have Netlify for our website, and that handles everything there. But everything from the actual application itself, including the frontend and backend, is served all from the same Go binary.

Okay. So the website part is deployed separately. That’s like your Netlify deployment. But the API, which is the thing that Slack interacts with - that is your Go binary.

Stephen Whitworth

Absolutely.

Okay. That was a really interesting thing. I couldn’t figure out, “How do I get the images, the screenshots that I do for incidents, on the incident page?” And I figured out that if you [unintelligible 00:37:27.22] things in Slack, you’re actually serving them from Slack, is that right?

Stephen Whitworth

Not quite. There’s some hidden complexity inside of Slack around images and being able to serve those. So there are two types of ways that images will show up within Slack. One of those is like an unfurl; so if you have a public image URL, for example, you post in Slack, that will unfurl in Slack. And if you were to pin that, we could show on the timeline just by sort of using that original source URL.

There is a second type of image that will display, which is an upload that you’ve done. So if I have an image of my laptop and I decide to upload that into my Slack workspace, that goes into Slack. Slack stores it on their servers, rather than unfurling from somewhere external… And it presents it out to you. And the URL that they present it out to you on is an authenticated URL, so you have to manage some of that complexity if you were to serve it through Slack.

So what we do actually is we anonymize images, we upload them to Google Cloud Storage, and then when you come to render your timeline, what we will do is we will enrich that [unintelligible 00:38:27.20] timeline item with a signed, short-lived URL for that image, to serve it out, basically.

That’s interesting.

Stephen Whitworth

So a little bit of complexity to get that seemingly simple feature working.

Because I was wondering, where do you store those images? You have to put them somewhere, if you can’t get them from Slack… Which kind of makes sense. You have to store them somewhere, and Google storage seems to be the place where you do that from. Interesting. I like that.

So the simplicity - I can see how this keeps coming back. It seems to be a theme, keeping things simple, so that you can iterate faster. I think there’s something there… That is obviously an understatement - I’m being a bit ironic, because yes, that’s exactly how it works. If you keep things simple, on-purpose, things will be fast, things will be straightforward. That’s exactly it. So I like that even in your infrastructure setup, that’s how it works.

Do you use any feature flags when you have new features? How do you ship new features before they’re finished?

Stephen Whitworth

We do use feature flags. We don’t have a particularly sophisticated setup there yet. So we’re not using an Optimizely or LaunchDarkly or whatever the products are that do that. But we do have mechanisms internally to be able to say “This is just for us”, so we will quite often test things ourselves in production to be able to do that.

[39:43] I expect, as we grow, we will start growing the maturity around that, so that we can start building things for specific customers, and toggling it just for them, to help us build it in the open and get their feedback as we go. We’ve had a few companies actually that have been essentially design partners on features, and it’s just incredibly useful to have someone with a real-world use case and a real need for a thing, and sort of building it with them and shaping it, rather than the – I mean, clearly, no one’s gonna be doing a sort of waterfall “Give me all your requirements and I’ll build you your thing.” But even just in a world where you’re building it week on week, and you have to send them updates, that’s clearly a lot less good than “Here’s this thing that’s about 30% done, and it doesn’t do all of these other things, but you can play with that 30% in your live environment and it will work and you can give us feedback from that.”

We’re also very open about what we’re working on… We have a public product roadmap which people can visit on Notion. We have a Slack community full of wonderful people that we also tell what we’re going to build next… And coming back to the infrastructure side of things, this is all very intentional, because as an early-stage company, we are essentially trying to search for the product that solves the problems that our customers have. We can only do that, or we can do that most effectively if we can build things really quickly and see if what we think is true is actually true. And we can only do that if people can see what’s coming up next as well, so that they can help us prioritize and say “Actually, I’d really love to be able to automate things in my incidents, rather than have an API so I can automate them myself.” And being able to do that sort of prioritization, both with a customer directly, but also with all of our customers, and be able to ship that stuff really quickly is really useful, and again, is just why we build stuff in as simple a way as we can get away with, essentially.

Now that you’ve mentioned this, Stephen, it reminded me of a feature that I was looking for and I couldn’t find, that’s runbooks. And I was wondering, where do you sit with runbooks? Do you see them part of Incident.io? How do you think about them?

Stephen Whitworth

Yeah, it’s a great question. Fundamentally, we’re trying to build the sort of rails that you will run your incident process on. So the automation. And runbooks are a great way of saying, “Hey, this is a particular type of incident, and in this case you want to go do A, B and C.” What we’ve found up until this point is that I guess from a product perspective we’re not sure where this should live. In previous companies, these have lived in GitHub repositories, in other places that are in confluence… Some products offer executable runbooks, so you can actually just go in and SSH into Node, and in the document you actually have a live shell… And it’s really just a – we haven’t figured out the right approach for it yet, which is why we haven’t built it. We’re going to get to it in a few months’ time.

The first thing that we’re going to build in order to make that more powerful is workflows. Workflows is a way to – think of it a bit like Zapier or IFTTT for incidents. So if you can say – in a particular case, let’s say in a platform incident, I want to go and page the on-call engineer, I want to send an email to this particular address, and I want to go and create five of these actions in Incident.io. That kind of looks a bit like a runbook, and we’re not sure - is a runbook a set of actions? Is it a document? We’re not totally sure yet… But what we are sure about is you’re going to want different runbooks based on different things, and we need to give you that layer of being able to say “This incident is different to that incident, and in this case, do something different.” And then once we have that, we can essentially build better runbooks off the top of that. Sorry, that was a bit complicated…

[43:41] No, that was very good. That was very good. I’m thinking in my head how does this link to my experience and what specifically I’m missing in Incident.io from the Changelog incidents which I ran. One of them - and actually, even more - I’ve caught myself wanting to write down, like “This is one of the steps that you will need to take. And by the way, this step links to this other step.” And before you know it, you have like a series of steps that you may want to follow. And some may be optional, because I don’t know, the same thing will happen next time… But I know that this is where I need to look, and this is important, and this maybe is relevant, but I don’t know, because it was relevant now… So a way to capture almost like the steps that were followed to solve an incident, to understand an incident, whatever the case may be… And what we have even today - we have some make targets… You can laugh. That’s funny. Like, why would you have make targets for –

Stephen Whitworth

[unintelligible 00:44:35.20]

For following processes, like a series of steps, right? So we do like make, how to, rotate, secrets. And then it gives you a series of steps, you press Yes to the next one, next one, next one, and then eventually, you have rotate the secret. For example, how to upgrade Elixir. You run the make target, and it shows you step by step what you need to do; and there’s this file, and there’s that file, and a couple of files. Now, could they be automated? Yes. Should they be automated? We don’t know, because it depends how often we use them. So it’s almost like there is a lot of knowledge that can be captured in these incidents, and by seeing which incidents keep coming up – and again, an incident is not something bad. It’s something that needs to improve; so there’s that positive mindset. Like, credentials have leaked; I need to rotate them. It’s an incident. So what are the steps I need to follow to rotate credentials? That’s one way that I’m looking at it. So that is my perspective, and that’s how I’m approaching this.

Stephen Whitworth

I think that’s very legitimate. What we’re trying to build at Incident.io is essentially a structure store of information that takes data from Slack, from Zoom, from escalations through PagerDuty, from errors in Sentry, sort of pulls it down into a set of rows and columns, whereas previously it was scattered throughout all of these tools… And then once we have this structured data that says “Okay, Chris was in this incident, and then Stephen was paged in, and it affected this particular product”, that is now queriable, structured information that you can go and do interesting things like recommendations. Does this look similar to something else that has happened? There’s lots and lots of stuff there. We haven’t really dipped our toe into it yet, but above us sits a whole layer of monitoring; the datadogs, the grafanas of the world. We’re not currently ingesting any of that information, like deployments, or any monitoring information, but you can imagine that our set of structured information becomes a lot richer when we integrate back upwards into those tools… But also, we don’t want to be this silo of information that hides it away in our SaaS tool. We would also like to build APIs, integrations, exports to BigQuery… Just ways of getting your data back out into your own tools such that you can really just set off of this structured set of information and build what you want off the end of it.

So yeah, I think there’s a lot of stuff here. We’ve barely just scratched the surface of what will be useful once we’ve got all this stuff in Postgres, essentially.

This is really attractive to me from the perspective of - building something simple requires you to understand the problem really well. That takes time. Building something complex - it’s fairly easy. “Sling some stuff. Does it work? Well, it works for some; it’s okay. Let’s just move on. More features, more features.” And before we know it, no one actually wants to use the product, because it’s too complicated. So I’ve seen so many products fail in that way. So the attractive part is this relentless focus to simplicity, keeping it simple, understanding it well. What makes sense. Like, “Okay, Gerhard told us this. What are other customers telling us? What makes sense for them, and what is this common thread which delivers on 80% of what people are asking for? And the rest 20% is too complicated, maybe not worth doing. But let’s focus on the 80%, which is the majority.” So I like that approach. That makes a lot of sense to me.

Stephen Whitworth

Have you heard of the Mark Twain quote…? It says “I didn’t have time to write you a short letter, so I wrote you a long one instead.” That is extremely applicable to product development.

[48:11] Definitely.

Stephen Whitworth

It takes time to build something simple.

Speaking of letters, I’m thinking about your blog posts. Some of them are really, really good. I can tell you which my favorite one is, but I’m wondering which is your favorite, Chris. It doesn’t have to be yours, by the way. It can be Stephen’s, or Pete’s…

Oh, it’s gonna be mine, obviously. What are you talking about…? [laughter]

Sure.

I’ll tell you one I actually really enjoyed both researching and writing. That was the one that was around learning from incidents in Formula 1. It’s sort of less an opinion piece on how people should do incidents or anything else, but more a spotlight on an incident that I think was run impeccably well. This was one where a minor Formula 1 crash happened when a driver was making his way from the pits around to the grid before the race had started, and caused some damage… And they then fixed the car, from the starting grid, faster than they’d ever done it before, and with none of the garage things they needed around them. It was just incredible… This was all captured on video as well.

I think when you look at that, there’s just so much that essentially anyone who’s dealing with incidents can learn from that. And I think that’s a really important thread, actually, for us at Incident.io. We’re breaking new ground in many ways, but incidents have been around for – stuff has gone wrong for a very long time, so there’s a lot of interesting learnings that we can take from other industries, whether it’s Formula 1, or incident command on fire response type thing… There’s so much to learn, and I think that blog post - yeah, I really enjoyed writing it. But the video, if you haven’t seen it, go and take a look. It’s a really fascinating watch.

Okay. I was gonna say that was my favorite, but now I can’t, because it’s your favorite… And I like cars. I like Formula 1, but especially cars, and it really resonated with me. I’m a visual person, I like videos, so I liked it… But there’s another one which is a second close. But before I reveal mine, Stephen, which one is yours?

Stephen Whitworth

Rather predictably, I’m gonna pick one of my own blog posts, following Chris’ trend… So I wrote a blog post called “Incidents are for everyone.” Fundamentally, this is, I guess, a calling card; like, the thing that we are building Incident.io with the belief of… It’s that current tooling is very, very focused on engineers. So think sort of the PagerDuties and the Opsgenies of the world - these are engineering tools; they present JSON to people. They are very good, but they are not particularly comprehensible to someone working in, say, customer support, or in the sales team, or in the executive team. This is not a slight on their intelligence or anything like that, but it’s just not a product that they’re used to. And fundamentally, we think that incidents just do involve way more teams than engineering.

[51:00] So if you think about an incident at a bank, for example. If payments are failing, that might be because some Kubernetes pod is having issues. But actually, that’s a [unintelligible 00:51:09.17] reportable incident. It needs to have an incident manager there. Executives need to be there to make a call. Customer support has lots of people that are waiting to chat with them. And all of these people need to be involved and present in the incident as well. And really, that is why we’re building Incident.io. We’re trying to build a tool that caters to the needs of these folks that have been, I guess, too long left out of incidents. We want to build something that feels native to them, and that allows them to get [unintelligible 00:51:37.26]

This keeps coming up, and I cannot help not notice it, and not even mention it in this case… Bringing people together - really powerful. I mean, that’s what it sounds like to me. We need to bring these people together, because each and every one of them has something of value… But they’re not talking in the right ways. Or if they’re talking, there’s too much information overload. So how can we simplify, condense and compress those really valuable pieces of information in a way that people can understand, follow, relate to, go back to, learn from… Super, super-important. And I like this - bringing people together.

Stephen Whitworth

I have to ask you, Gerhard, [unintelligible 00:52:17.05] your favorite one.

“Why more incidents is not a bad thing.” Actually, “is no bad thing.” July 1st, 2021. This is something which I mentioned earlier, in that I’m looking forward to incidents… Which is really weird. But Incident.io makes me do that… Which again, is that mindshift which I talked about. So if you are set to learn from failure, which in my mind, that’s the title that we’ll give this episode, but you let me know if you have a better one in mind… Is if you’re learning, if you’re continuously learning, if you can make it to be a positive experience, then you’ll be looking to do more of that. And this is applicable to almost everything. If it’s fun, you wanna have more of it. Don’t have too much fun, because that can be a bad thing… [laughter] But can you combine being responsible, being adult, sharing information, having fun - and if you can combine all those things, bring people together? What can be better than that? I don’t know. I’m yet to discover it. I’m not saying there isn’t something better than this… But this sounds like a really good proposition to me. So I quite like that.

Yeah. We think this blog post is essentially an acceptance of reality. Stuff breaks all the time, in little and large ways… You can try and ignore that, or you can solve it in Slack DMs, or you can accept it, use it as a signal to inform what you should do next… And like you say, try and have fun whilst you solve it as a team, together.

I was going to say which is your key takeaway, but we’ve had quite a few takeaways in this last part; they’re all very valuable. The blog posts are really good. They’re not too long, not too short, they’re just the right amount. Go check them out… And keep looking forward to incidents. It’s not a bad thing.

Thank you very much for joining me. This was great fun. I’m looking forward to the next one.

Stephen Whitworth

Our pleasure.

Thanks so much. We really enjoyed it.

Stephen Whitworth

Thank you.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

View all episodes

Player art