Ship It! – Episode #38

Go for the bananas

with Gunnar Holwerda & Tom Pansino

All Episodes

Gunnar Holwerda (Engineering Manager) and Tom Pansino (DevOps Team Lead) share with us a few stories about how the teams at opensesame.com manage AWS operational complexity. The first link in the episode show notes are the slides that Tom & Gunnar prepared for this conversation. Check them out as you hear us speak about the Inverse Conway Manoeuvre, and why you should always go for the bananas.

If you like this episode, and have a similar story to share, please reach out to us. We all love real-world stories that we can learn from, and perhaps contribute to.

Featuring

Sponsors

RaygunNever miss another mission-critical issue again — Raygun Alerting is now available for Crash Reporting and Real User Monitoring, to make sure you are quickly notified of the errors, crashes, and front-end performance issues that matter most to you and your business. Set thresholds for your alert based on an increase in error count, a spike in load time, or new issues introduced in the latest deployment. Start your free 14-day trial at Raygun.com

Retool – Retool is a low-code platform built specifically for developers that makes it fast and easy to build internal tools. Instead of building internal tools from scratch, the world’s best teams, from startups to Fortune 500s, are using Retool to power their internal apps. Learn more and try it for free at retool.com/changelog

RewatchRewatch gives product and engineering teams async superpowers and helps them move faster with greater clarity. Imagine all of your team’s videos, all in one place. Record, organize, and share the videos that your team needs to ship great work. Get started for free with 14-day trial at rewatch.com.

DatadogSaaS monitoring and security platform enabling full-stack observability for developers, IT operations, security and business teams in the cloud age. Their unified platform, along with 500+ vendor backed integrations, allows you to correlate metrics, traces, logs and security signals across your applications, infrastructure and third-party services in a single pane of glass. Learn more at datadog.com/changleog

Notes & Links

📝 Edit Notes

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

So some of our listeners like Andrew Gunter and Lars Wikman - they’ve been asking for stories from the trenches, about the good and the bad parts about various tools that people have used. And while episode #30 and #28, they do come to mind, it’s been quite a while since we did anything like this, which is why when you, Gunnar, made a proposal to talk about this, it got me very excited. So today we have Gunnar and Tom joining us today to share one or few real-world stories from the trenches. Welcome to Ship It.

Hello. Thank you for having us.

I am Tom.

Hey, Tom. Hey, Gunnar.

Hello, I’m Gunnar. Yeah, I’m the original submitter of this. And I think what Tom has created for us at OpenSesame is super-awesome.

Why did you reach out? Is it because Tom is awesome? Is that what made you reach out, Gunnar?

Yeah. I mean, it was pretty much – I think Tom was super awesome. People have got to know more about Tom. That’s part of it. But really, it was - as we scaled our teams and needed a way to kind of change how we did our AWS accounts and things of that sort, the way that they went about doing this and the system they built, I thought was just super-cool. And then this is my first go at it, so I hadn’t seen anything like this. I thought it was pretty unique. So the ability to automate account setup and team setup and all of that with all of our various accounts I thought was very cool and something that I hadn’t seen or heard about anywhere else. So I thought it was pretty unique and thought it’d be an interesting thing for people to hear about.

[04:08] So what is this thing that you’ve set up, Tom? Can you tell us more about it?

Yeah. So our company – just a little bit of maybe background as to why is this important or why does this matter. So our company, OpenSesame, is about a mid-sized startup company - a more startup type company, I should say, startup-like. We have about 150 employees, I think, and about 50 of those are engineers. And like most startups, when we started out, we were one team, and we were one team working on an initial product. And then as the company grew, eventually, the company split the engineering division into two different teams. And then eventually, we decided to split into seven teams, doing something that’s called the Inverse or the Reverse Conway Manoeuvre. This is something that, for people who have done it before, it might sound familiar, but it’s basically the idea that as the system scales out and grows up, eventually it becomes too complex for one person or for any one person to be able to explain or comprehend the entire system.

So Melvin Conway - and I’ll include a link to some of his writing in the show notes - he basically talked about how, in his observation, organizationally, I think in the 1960’s systems and the people who work on those systems tend to pull each other into alignment. If you think about it, it’s kind of like, as you’re working on the code or working on some part of the system, you talk to your immediate teammates about changes and factors there, but you tend to talk less frequently with people outside of your team about the same changes on those parts of the system.

And so in his theory, it was possible to leverage that force - he calls it the isomorphic or the homomorphic force I’ve seen it called, where the people in the system will pull each other into alignment. So that if you want to have a system architecture that is a certain way, one of the best ways to do that is to form teams that mirror that architecture. Like in our case, we’re a commerce platform, we are selling e-learning courses to customers - you can have a team that’s maybe focused around the commerce side, or that’s handling payments, and a team that’s focused around the catalog side, that’s serving up what is in our catalog, what is in our offerings. That’s what Gunnar’s team does.

And so as the system scales out and you have more and more teams, you also run into challenges around trying to govern those teams, or have compliance, or do basic setup of those teams and access control; all these things that’ll probably be really people who like security and those kinds of concepts. And so that’s why we brought in the concept of the Bootstrap, which is this project that we designed.

Other places that I have worked had some similar things, but I’ve built on top of that with some colleagues and managed to create something where we can effectively do lightweight governance and control of the various parts of the system that need to have that for auditing and compliance purposes and accountability to our stakeholders, but also in a productivity way too, where we’re able to deliver some resources and things to people that are maybe otherwise complicated to work with.

So just to summarize all of this, was this an initiative, a way to organize yourselves better, so that 50 people are more efficient, it’s more visible what you’re working, it makes more sense, you’re not duplicating effort? What was the primary driver for organizing yourselves better?

Yeah, I would say that’s absolutely what it is. What was the primary driver?

I mean, I think, ultimately– so I’ve been at OpenSesame for eight years now, and we originally started with a team of six or seven. This is the standard startup growth. We had a team of six or seven, we grew to 10 or 11, and you started to run into 10 or 11 people in one sprint, one backlog, getting kind of hectic, and you’re working across different aspects. There’d be days where we’d be working in Fastly, in the VCL, and there’d be days where we’d be writing angular front-end code. And so we split into two teams at that point, to have a little bit more focus. And then when we got up to 25 people or 20 people, we saw the same need, to be able to focus on specific areas that were important to the business.

[08:11] So Tom mentioned the commerce side. We have a large legacy platform that we’re looking to improve and move away from and refactor out of, and trying to have a big group in one big team or two big teams, trying to focus on that was going to be ultimately pretty difficult, so we picked how we wanted to structure that and how we wanted to support that… And so we kind of split out into multiple different teams, one kind of focusing on our API integrations with our customers, my team, the catalog, to own the whole commerce system in terms of showing the courses, searching the courses and some of our other platform features, and then another team to own the payment side and some of the unique ways that we handle the app on the backend of how we pay our publishers and things of that sort because it’s a marketplace. And then having a DevOps team to support us in helping us have the right platform, and then a core team to help set some patterns, and they also own some legacy architecture.

So it was really to enable us to have distinct focuses - and we’ll maybe get into Team Topologies, which is a book that we all kind of read it as a leadership team and some of the engineering managers have as well. It talks a lot about cognitive load of teams, and you want to keep that as low as possible, so that people can focus on their stream-aligned work and what they’re trying to accomplish. So that was our way of whittling that down, was trying to create that cognitive load as low as possible, so that we aren’t working from Fastly to Angular all the time.

So did you find that splitting the engineers in terms of areas of competency was something that helped, in that people that were maybe doing more opsy work, they naturally formed a team, and the ones that dealing with Angular? Is that what end up happening, or was it a bit more complex than just that?

I would say that’s not the intention. It’s more complex than that. The goal really of, I think, DevOps as a philosophy is sort of the antithesis of what you described. The goal is not to put all of the networking specialists on one team and all of the frontend specialists on another team in the backend. I mean, that does happen for the Team Topologies book, which is a great book, and we’ll include a link to that as well… But it talks about, particularly, the different types of teams and it mentions that there’s four fundamental types of teams. I’ll mention two of those. One is the platform team, which I think everybody is somewhat familiar with who’s worked in software. It’s something that you can build on top of, that makes your life a little easier. And the other is a value stream team, which is what most people probably work in. It’s the most common type. So you have a particular amount – or a particular revenue stream for the business, or a particular value add that you deliver to the business, and you form a team around that particular component. So Gunnar being the Catalog team, or my team being the DevOps team - it’s like doing a platform for that team.

So I think that the idea for most companies these days, if they’re following the DevOps philosophy, is you try to make teams self-sufficient, to the extent that they can be, with the exception of maybe the platform. So you’re self-sufficient on the platform that’s being built as much as possible. The point of creating the platform boundary or creating these components that are bounded contexts, the cognitive load, is to reduce that cognitive load. So as a catalog person, I don’t have to think about what payment processing looks like. I don’t have to worry about PCI compliance, payment card industry stuff. I don’t have to worry about dealing maybe with customer-facing information and how to secure that properly within my system. I just have to worry about how to serve up the list of courses we have in the catalog to a customer, so that they can make an informed purchase, and that’s really helpful, I have found both personally, as an engineer and just conceptually as a person trying to design in the system.

[12:01] We use the term, in our company, domains. Not like a DNS domain, but like a product domain. And so each of those teams that we’ve mentioned has one or more domains that are associated with it, that they’re shepherding, that they are owning, or stewarding, is the term we use… To care for that domain, and to give it the investment and the care and the development and the operations and security work that it needs to be successful. And so a lot of creating these bounded contexts is both for the cognitive load, it’s for making access control easier on our end… My team, the DevOps team at OpenSesame does a lot of the access control design, so that we can have third party contractors and restrict their access to be very limited in scope, and then have different levels of what is a developer versus an operator, are some of the terms that we use. So yeah, it’s neat.

But yeah, I mean, ultimately, there are a lot of long conversations that we had as a leadership group with some engineers about what are the domains, what are the features, how are those grouped, things of that sort, and then trying to find larger teams that would own those. So we created this ownership model of this is the domain, these are the features or the areas that they own. And that’s something that’s very fluid. We learned right away when we did that, it wasn’t quite right. There were still a lot of questions. And as we’ve moved forward over the past couple years, it’s gotten better, but there’s still changes, as people get different – or we realize boundaries aren’t where we actually thought they were, and things of that sort. So that was ultimately how we went about getting those domains.

Listening to both of you makes me realize that you’re definitely on the right kind of spectrum. I mean, it is a spectrum after all. It’s not like you’re right or you’re absolutely wrong. You’re somewhere in between. But I see you being on the right kind of spectrum. And the reason why I say that is because - you’re right, DevOps, as it’s practiced at many companies, is wrong. You have the networking experts, the DB experts, and even the platform. The platform is almost like a function that people need to have a good interface into and understand how to consume it. But ultimately, it’s just an API. And if you think about things that way, it really helps break down the complexity into nicely bounded contexts that everyone can interact with. And things are simple, in that you don’t need to be an AWS expert to find why things don’t work in a certain way. And I think that’s where you come in, Tom, right? That’s where the platform team is meant to abstract some of that complexity, hide it behind a nice, simple API - in my mind, that’s what I’m imagining - that is documented, that is well understood, that is well spread through the company, and is consistent, long term.

Yes, you’re definitely correct in that. And I think a really good example of that, that Gunnar wanted me to mention, is we deliver that as part of our package. So when you sign up a team with us and you say, “I want to start a new team” or “I want to start a new domain”, you come to my team and make a request in our Slack channel, you give us the name of what you want to have created, in that bounded context, we’re delivering a set of resources in all of our cloud providers. So you say you want Fubar as your new domain or something like that - we’re going to go and create Amazon accounts, we’ll have a development account, a staging account to do load testing and other types of testing, and a production account. And so each of those accounts follows in line with Amazon best practices, and they have white papers written on why you should use accounts as your bounded context box to put things in… And it helps with what we refer to as the blast radius. I mean, if I have somebody who maybe makes a mistake, deletes a few too many things, it doesn’t tank the entire system. It’ll only hurt that one account. And separating out development and production we all know as a good idea as well.

But then in addition to those accounts, we also go a little bit further. And so as a piece of expertise, my team has some networking expertise in it, and creating networking is difficult in the cloud, many times… And so what we do is we deliver a VPC, as a package, as part of our account structure as well, so a networking stack. It has a VPC in it. It has one or more sub-NATs in it. Each of those subnets has a NAT gateway, so you can have private servers, private application servers behind the NAT gateway for extra security; it has an internet gateway for connectivity. We teach people about how to use security groups. They’re set up with multiple availability zones for redundancy, so that people don’t have to think about that, and it’s just like, “Yeah, you have a server that’s in the cloud, that’s running? Great. Bring it to our VPC, plug it in with these subnets in the Amazon UI and you’re off and running”, and you don’t have to think about networking anymore, other than if it breaks, which we have health checks for that.

[16:37] So on our end, we try to run it as a service to where there’s nothing that the customer has to really spend a lot of time thinking about. That’s kind of the goal on our end; as you said, it’s almost a team API, not just a code API, and that’s where I think the Topologies book refers to it as this team API, of what do I need to do and what do you need to do to be successful.

Yeah. When we first split out into multiple teams, I think – it’s not where it’s like, “Oh, well, you build on top of the platform. You can end up waiting.” Every team needs to continue to move forward. And so as every team’s got their own domain, they’ve got through their own package of resources from Tom’s team, everyone started to build. And over time, we started giving feedback that this networking aspect of it is complicated for people, or it’s going to be different. And it’s an area where it’s complex, but it also needs to be somewhat coordinated, to avoid IP collision for security reasons, things of that sort.

And so that was some feedback that we were able to give to Tom’s team and they were able to come out with the VPC module for us to all adopt. And certain teams already had structured their own VPCs, and were able to migrate over, and some hadn’t quite gotten to that point yet; it wasn’t building in their new AWS accounts, and so they were able to just adopt it from the get-go, which was super nice and simple, and it was a way that we could get that benefit and provide the feedback from what was actually a struggle for the certain teams, as we don’t have AWS– not all of our engineers on the team are AWS experts. We have a couple here or there.

But then I did remember when it broke - and that’s why we have the health checks now - as initially rolled out; I think someone deleted something or a team couldn’t access any more resources, and Tom’s team was quick to fix it and push that fix out to everyone. I think that’s what’s also super-cool about this, is because it’s all in Terraform, it just gets shipped out to all the teams as they wrap it up, and then they Terraform a Lambda into everyone’s account that is kind of phoning home. I call it the ET Lambda, to make sure that the network access between different areas is working properly. And so it was kind of a cool story. That was when I was like, “Whoa, this is pretty neat to be able to do this and solve all these problems for everyone.”

So I’m hearing a lot of complexity that was captured in the right– I won’t call them boxes, I would call them maybe areas, and that you push it and pull it and tease it apart, so that you don’t have to worry about all the things all the time. To me, that sounds like a very healthy group of people knowing how to divide the complexity, which is great to hear.

[20:07] But the other thing which I’m wondering - I mean, actually, understanding this, it makes me wonder, how does it actually work? So when a team comes in your Slack channel and asks for a new set of accounts to be spun up, or just an account to be spun up, what happens behind the scenes? You go and you click some buttons, you add some configuration, you run a CI… What does it look like? Can you run us through that, Tom?

Yup. That’s it. You’re hired. You got it.

That’s all the parts right there.

That simple? I don’t think so.

Yeah. It’s that easy. Yeah. Genuinely, it is that easy. There is essentially a configuration file that – we have a repository that represents the whole of the existing Bootstrap project. We’re probably going to try to publish a demo version here for folks to look at, open source style, just as soon as we figure out how to do that.

Okay. I’m looking forward to that. Okay.

It’s a lot of pieces, but none of it is particularly proprietary. So we just need to clear that all with our legal team, and stuff. But we’re using Terraform, which is of course HashiCorps’s infrastructure as code language, so the idea being that you want to have the state of your infrastructure captured just as well as your development code is. And our goal is to make it so that basically everything that we push in this layer of this account structure, this platform layer is delivered via Terraform.

So we’ve got our repo inside of our repository, we have a configuration file, which has an array with objects that represent each of our accounts, or all of our domains. It’s really the domains, it’s not the accounts. So it lists out the name of each domain and the types of environments we want to create for each, so a development or a dev, stage or a prod, all of those things. And then when you run it, the Terraform is structured with various modules. If you’ve worked with Terraform, this is going to sound really familiar, but various modules and things, that one of them I think is– there’s an account module, basically, for Amazon. And so it’s going to provision a new Amazon account, it’s to go and do things like set the support plan to be the correct tier of support that we ask all of our production accounts to have, which is I think the four-hour response time or whatever it is, or the one hour. I forget exactly what we use. And so then in addition, it’s also going to New Relic and it’s setting up accounts in New Relic and provisioning API keys there.

We do a lot of work with– and many people who have worked at larger companies are probably familiar with a concept of like a service account user, or a machine user. We call them faceless users. It’s a user who has no face, is how I was taught; sitting in a chair. And so the point is to say, “I want to connect my GitHub actions back to Amazon to let it do deployments.” So we provision a user and a password for that, and we put that password in GitHub, and in a secrets manager, so that people can start using it in their CI flows. So yeah, it basically runs through a GitHub Action, CI/CD, and does a deploy, and then all of these resources, several thousand of them, spring into existence and start doing what the teams need them to do.

So GitHub Actions runs the Terraform? Is that right? The terraform apply.

That’s right.

How do you know if it’s not going to delete anything? How do you know when it converges what will happen?

Yeah, that’s a great question. It’s actually one that we’re doing some active work on right now in my team to better understand how– I’m going to extract your question a little bit higher and say, how do you do quality assurance on something like Terraform, where you don’t actually know what’s going to be delivered by the Terraform until after you run it, in many cases? If you’ve ever seen the Terraform plan, it’s that block that says, “Known after apply”, meaning “I don’t know what I’m about to do, but I’ll let you know when I’ve done it.”

[23:44] In our team, we use a couple of different things to try to combat that. The biggest right now, which is kind of manual, is we run a Terraform plan in the cloud. So we run it through GitHub Actions. I’ve seen some of your sample code that you’ve published for your listeners… So we’re using a workflow dispatch hook, basically, in GitHub Actions, which is an on-demand, run this thing, and put the results in my PR. And so we’re using that to give us a Terraform plan. And one of our engineers has nicely made an aggregator that takes in and dumps all the plan details into one comment, so that we can look at them easily and say, “Okay, we have 130-some different plans that we just ran; what are the changes that are actually contained in that?” And it can be tens of thousands of lines. So we try and look at them. We look for the patterns in that and say, “Okay, we are pretty confident this is going to apply correctly.”

And then in addition, we actually have a staging environment that we use. It has a set of representative accounts in it. It’s not one-to-one with the production environment, because that would be really expensive to create resources for all of that, but we do have a representative set of all of the features that we have. So I mentioned, for example, the faceless users. We have one of each type of faceless user that we test out and make sure that they deploy it correctly. And then in our deployment pipeline we merge, and then the deploy runs, and the deploy will deploy to the staging environment first. And if it crashes there, it will halt the deploy, and then we take a look at it, we get notified in our Slack channel. And then same thing in the production - if the production deploy fails to deploy for some reason, we get notified; we can go in troubleshoot it, that kind of thing.

Has it ever happened that something worked in staging, and then when you went to production, it failed? Has it happened in the past?

On this project, I’m not sure. But I would say probably.

Okay. Because I’d be very curious to know what did you do when things failed? That is a very interesting question to me when it comes from the resiliency perspective of a system. How resilient it is? How did you handle that? What happened? What did you learn when things failed? It’s one of my favorite ways to learn, which is why I delete things left, right and center, just to see what will happen, including taking the website down in production, “Does this thing even work? Why not? Will the CDN server–”

Do some chaos engineering.

Exactly. Yeah. But in this case, for fun. Not as a monkey, for fun; like, “What will happen?” Exploratory.

I’m trying to remember, but I remember hopping on a call with you… This might rattle your brain. We hopped on a call. I think you were with Kira when she was a DevOps intern with you all, and the deploy in product failed, but it succeeded in staging. You ended up debugging it. You essentially pulled down the state on your local machine and did some surgery, some Terraform surgery to figure out what was going on. Do you remember this instance and what that was? Because I feel like that’s what Gerhard was asking.

I vaguely remember this, and I don’t remember the logistics of it, but that is exactly the thing I was trying to recall. Let’s see… What do you do when the deploy passes in staging, but it doesn’t pass in production? So I’m going to speak generally for a second, while I try and recall more specifics… In my experience, when the staging deploy passes but the production doesn’t, it’s usually a sign of something that’s a stage-prod parity issue. There’s usually something that’s not equivalent to production in the staging environment. And it’s one of the reasons that we have the concept of staging within the Bootstrap, and not just development and production.

I’ve had numerous times working as an application developer, where we wrote things – I used to work with Django a lot in Python, and Django has an ORM built into it (object-relational mapper). And those are notoriously non-performant for certain situations. So then you need to write some hand tuned SQL to make it work for getting your objects out of your database.

And so what I’ve had happen in those situations is it worked great in development when we were working on a fixed set of data. And when we tried to run it then at scale, it would completely tank the productions system. And so staging in that particular application team, where I worked at a previous company, was designed to be a scale test. So we had a fork of production traffic that we would sanitize, and then we would run all of our migrations and all of our code against it and our end-to-end tests to make sure that all of our API endpoints were still responsive at scales of tens of thousands of records, rather than just 10 records.

Yeah, that’s right.

[28:06] So usually when I run into those kinds of situations, I ask ourselves, “What was different about staging that was not present in production? And how do we update our staging system?” In Terraform’s case, when Terraform breaks, if something is actually busted in production, it really depends on how it broke in order to fix it. If there is a server that’s created that shouldn’t be there, many times you have to go and manually remove it or clean it up. And so we try to avoid having really large cleanups by doing all of that scale at the stage level. And thankfully, I think in the Bootstrap time that I’ve been using it, we’ve never had a massive outage that couldn’t be corrected with a Terraform just small update of some kind, or a reapply. That is the benefit of Terraform, right, Is you can just rerun it and it will generally fix most things. But yeah, I’m trying to think of a more specific example, Gerhard, for you.

How long have you been running the Bootstrap project for?

About 18 months, I want to say.

Okay. And are all the domains, at this point, migrated to Bootstrap? Apart from the legacy system that Gunnar mentioned. I imagine that must be still running on the previous setup, whatever that was.

Yes and no. Actually, they all use the Bootstrap, in the sense that the Bootstrap is the thing that gives them life. Most of them didn’t exist before this, because it was a pain to set up lots of new accounts, and the Bootstrap made it easy. I tend to be one of those people that is, “Let’s try the process, but then let’s get the automation in place that makes the process worth doing at scale”, a lot of the time. That’s, I think, why people hire expensive DevOps engineers, is to automate a lot of this, and I used to be a sysadmin, so I like scripting lots of things.

Yeah, that makes sense.

Yeah. Even the legacy– I mean, the only thing that’s not… In some cases – this is actually, I think, a really good topic… How do you take an existing domain or set of resources, let’s say, and port them over to use a piece of the platform that you’ve just enhanced in some way? Maybe another small, good example of this is in our New Relic setup, when we started, we had one account in New Relic that everybody lived in, and it had a single set of API keys that everybody was using across all the domains. And then we said, “Well, we want to separate accounts for that as well, and have dev staging in prod for each of the domains in there.” So how do you migrate people out? And the answer to that, I think from the Team Topologies book perspective is you do some work that they call enabling work. Google calls these people site reliability engineers, and the way that they run their teams is you have a enabling team get together with your value team and work on a project for a period, and then they move on to the next project. So like in baseball, they’re like switch hitters. They’re people that you call in– or in American football, it would be special ops, or special teams. You bring them in when you need a specific thing done, and then a job, a specific play they run, and then they walk off the field and they go back to the bench.

We do a lot of that work as well as a platform team, just because our company isn’t large enough to where we have separate platform and enabling teams. But we’ll go and we’ll work with Gunnar’s team or whoever’s team needs to be migrated over to something, and we will figure out where they’re currently sitting, and what resources they’re using and how to that over to our Terraform stack, whether it’s networking or changing the API keys around, or doing advice on how to best implement something as far as like a workload that needs to be run, whether to use Lambda or Fargate or Kubernetes.

We do office hours twice a week, so people come and they can ask us questions. I really, really that pattern; it’s been really successful at every company I’ve seen it implemented at. So for platform teams especially, if you’re struggling with a way to connect with your customers, your internal customers, and tell them about your platform or expose them to parts of it in that team API, office hours can be a great way to do some hands0on training and learning about the thing that you’re building for them, and get them to use it.

How did the office hours work for you, Gunnar?

I mean, they’ve been awesome. It’s two times a week, right Tom? And our engineers sign up regularly. There are four slots, 15-minute slots - it’s for an hour - and our engineers go there every day. They’ll go in and register a slot like, “Hey, I’d like to talk about this”, or “Hey, I’d like to get some advice on this.” And so it’s been a great way for a lot of engineers to get other insight into what’s going on, and maybe why certain things are going on or certain things are being built, and talk about different architectures.

[32:28] It’s also a place where we will have a team go and talk to DevOps about like, “Hey, this is how we’re thinking about building this certain thing. Does this check out for you? Does this make sense?” There’s a lot more AWS expertise of different tools or patterns and things of that sort, that not everyone on my team or another team may be aware of. So it’s a good spot to just talk.

I think that’s what’s important about what Tom is saying, is so much of enablement work is communication. That’s like 90% of the work, rather than the actual coding or the configuration that you’re doing, and trying to either, 1) get people on board, that “Hey, this is coming down the line. This may have some work on your team, because we’re going to need your help, or you’re going to need to migrate onto this”, and also “Why is this important? Why are we doing this? Does it save us money? Is it quicker for us? Does it save a bunch of time for certain people?” So a lot of that communication is a big part of the work, and just making sure people are aware, understand, and things of that sort. Because even if you build a new, shiny thing, or you configure a new thing and people don’t know how to use it or why, it’s going to sit in the corner and build up a bunch of cobwebs. And so doing that communication is super-helpful. And we’ve had a lot of scenarios where, on my team, we built certain things or picked up certain stuff and have gone to office hours with Tom’s team and come back with either different ideas, tweaks, things of that sort, which allows to work more as a whole team instead of just in our individual silos. We can kind of spread that knowledge.

I think that’s another great point about office hours, is that a lot of times - getting back to the Inverse Conway Manoeuvre, the point of that exercise, of taking Conway’s law and saying, “We’re going to lean into the structure of the teams representing the structure of the system we want.” When you go from one team to N different teams, now you create silos of knowledge. And if maybe one person is really good at using Fargate on Amazon, but they don’t understand Lambda, and the task they’re doing really would be better suited for something like Lambda, they won’t know that. And so Office Hours is a way that we combat some of that siloing by having people come to us and say, “Yeah, this is what I’m thinking of doing with this”, and we say “Well, did you know that there’s this thing called Lambda that you could try, or this thing called Kubernetes that you might want to look into?” I don’t think I’ve ever recommended somebody look into Kubernetes in Office Hours, though. It’s a little bit maybe out of scope of the amount of work they would want to manage.

But yeah, I think it’s a great way to connect with people like Gunnar said about the ways in which that they’re trying to ship their product from an infrastructure perspective, using the people who are experts at infrastructure at the company. Because you’re not going to be able to fund potentially a dedicated networking specialist DBA for your team, DevOps engineer… Having dedicated resources per team is really expensive. So they tend to be somewhat centralized still, even today, but you want to use them and leverage them as if they were part of your team. So we do a ton of relationship building in my team, and we use that to build that goodwill that we build up to accomplish some of the company objectives as well.

With this pattern of having these Officer Hours and having a team that’s a group of experts in an area, I think you can run into a scenario where this group is almost treated like the architects, where because they’re experts in what they are, when the team goes to them with “Hey, this is my idea”, and Tom’s team recommends, “Hey, have you thought about doing this with Lambda?” or a bunch of new ideas, they go, “Oh, wow. Okay. I didn’t think of this, but that might take us six weeks to do and we only have a couple weeks here.” And so that becomes this push and pull. And I talked with Tom a lot about this.

[36:03] The analogy I use is the teams are like a food truck that sells hamburgers, and they know hamburgers, they give it to their customers. That’s really great. But they start hearing from their customers that they want a healthier option, so they go, “Okay. Well, I’m going to do veggie burgers.” It’s still a hamburger. They can add a veggie patty, it’s a little bit healthier. “Now I’m going to go do that.” And so they go to their supplier, the DevOps team, Tom’s team, and they go, “Hey, I’m looking to get some veggie burgers, because I’m looking for a healthier option.” And what sometimes can happen is the DevOps team can go, “Oh, I don’t know if you want a veggie burger. That’s really not that much healthier. Maybe you should do a salad.” The truck owner then can go, “Oh, crap. Well, I don’t really know how to make a salad. My truck’s only set up to make burgers. I don’t know if my customers would want salads.” This is the expert. They know what’s healthier or things of that sort, so they kind of freeze and don’t know what to do. And so it’s always important with this that the teams, the hamburger trucks are enabled to do what’s best for them. It’s advice, and not saying, “This is the right way to do it”, or something.

Right.

“You know what’s best for you. This is a suggestion. Have you thought about this? I haven’t thought about salads, but I think veggie burgers is the right way.” “Cool. You can go do this, but let’s set up a time to talk about salad-making some time, and figure out what that is.” So that’s something that I have seen happen a few times, with especially younger engineers that are earlier in their career, who kind of go to the team, see information from an expert, and then go, “Oh, crap, I’m doing it wrong.” It’s that imposter syndrome, when really you’re doing what’s right. You’re proposing what is familiar to you. And sometimes it’s okay to do that, without having to kind of rewrite everything or embark on a much larger project.

Yeah. I have so many questions now…

Yeah. Yeah.

I don’t think we have enough time just for the stuff that I want to ask, never mind the other things. So I’m wondering, how much of this do you write down? How much of this do you capture in a way that can be accessed after the conversations have been had, so that other team members can go back and maybe try to understand your reasoning behind it, how maybe you can go back and see how you can improve certain approaches in that? It seemed like a good idea. Why did you think it was a good idea? Because right now you don’t think it’s a good idea. So what led you to believe that in the past? And I’m wondering, how much of this do you write down, do you share in a way that can be searched, that can be accessed at another point in the future, and that helps spread information in asynchronous way? Because synchronous is really, really tough, with phone calls and meetings.

Yeah, we’re a remote-first company, so we try to work very asynchronously as well. Your question is specifically around how do I take the knowledge about what approaches worked with teams, or is it around the technical, like what technical things worked with teams?

It’s everything really. It’s how do you communicate that information that as an expert? You learn something new, how do you communicate with everyone? Because Office Hours - those are limited. Time you have for that meeting is limited time. How do you communicate efficiently for a remote-first team?

I will mention a couple of things that I can think of as practices that help with us. We do write a lot of things down, but we all also spend a lot of time together as a team. So in my team, aside from Office Hours, we have many conversations, because being a platform team, you tend to have - like Gunnar was mentioning - influence about the roadmap and the design of products across many different teams, and also you have lots of connections with those teams that you can leverage to help accomplish company objectives. And so we do use Confluence to document many things.

For technical things in my team or different approaches that we have, we will create what we call a decision document, which is maybe a pretty standard kind of thing of what you’d expect. It captures the details of what was the problem, what was the potential solution space, what are the constraints of that problem, and then what was the decision that was made around that. We use that for a lot of different technical things, but occasionally also if there’s a particular approach that’s worked well.

[40:00] My team also really loves retrospectives, and so we do not just our biweekly retrospective like most agile teams do, but we also do after many of the big conversations that we have with customers or stakeholders, we will have a mini retrospective. Our practice is to have a Zoom room that is always available to the team. So we use the standup that we have every day at 9:00 AM, we have our have our Zoom room there, and we just leave that room on throughout the day. Because we’re remote-first, we don’t have that idea of, “Hey, can I find people at their desks and talk to them about what just happened in that conversation with a customer?” Instead, we use the Zoom room. So if you’re not doing focused, heads-down type work, if you’re just sitting at your desk doing some normal work, we will all be in the Zoom together, and someone will say, “Hey, can I borrow people for five minutes to chat about what just happened?” We call it micro retrospectives. It started out with micro-grooming, or micro-refinements, which was looking at tickets one at a time, and doing a little bit of that process.

So all of our ceremonies have become sort of micro-ceremonies, as time goes by, and we reflect on what went well, what didn’t go well about that conversation, and how can we change our approach. As Gunnar mentioned, there have been times with some of the teams where we maybe gave them too many options or too many things to think about all at once, and we’ll talk about that as a team and coach each other on how can we be better at being attuned to what the customer is asking, and make sure that we don’t under-deliver, but also don’t over-deliver what they need to hear about the situation.

I like that you asked that question, Gerhard, because it makes me think of like, how could we do Office Hours better? Because there are a bunch of artifacts from Office Hours in those decisions. And generally, the way that we’ve handled that communication has been the catalog team comes to you, asks a question around something, and then the delivery team comes to you the next week or the next month when they run into it, with the same question, you go, “Hey…”

And we have four of the same conversation.

Right. Yeah, four of them. But it’s also being aware from this standpoint to send them to each other to talk about, “Hey, work together and see what you should all accomplish; because since the catalog teams last talked with you a month ago or something, now they’ve learned things through that that they can share with delivery.” And so trying to create those connection between the teams when you see that they’re doing similar things as well. I don’t know, that’s a takeaway I have to maybe think about a little bit more, is how can we document that and make it easier.

And we use a lot of those conversations when we find multiple teams are working on the same project, to change our roadmap and attune it. The VPC one was a great example where we knew multiple teams were trying to create new services that needed networking stacks, and we had it on our roadmap to build a standardized VPC to offer people someday as a feature of the platform. And we reached out to those teams and said, “If we pulled that work in, would that save you time?”, and they said, “Yes.” So we did it in about a week and a half, and we just really quickly put together this networking stack that was kind of based on some past resources we’d used, and some open source Terraform modules that are out there, which I’ll also give a link to. They’re really great, and stuff. But yeah, it worked out really well to be able to have those conversations and have that information about what is each team working on, and how can we deliver value based on what they’re currently trying to execute.

I’m wondering, all this, when it’s put together - the process, the conversations, the expertise, the delivery of the solutions, which you just go in and “I want a healthy burger”, and I say, “Have you thought about the salad?”, and they give you that… I really like that analogy, by the way. I’m wondering, how does this reflect in how a developer approaches having an idea, putting it in code, doing a git push - what happens afterwards? How does his or her job become easier because of all this work that you do? How does that improve?

I think the goal is to make it so that when that happens, there are fewer things that the customer has to worry about going wrong. In particular, thinking about the VPC, that they should be able trust that their networking stack is not going to fall over now if there is a downtime in one of the availability zones; it should be able to withstand that, and they shouldn’t get an alert in the production environment because of that and be woken up in the middle of the night for an incident. They shouldn’t have maybe as many security scan findings that need to be remediated the next time there’s a penetration test done at the company across all systems.

When it comes to git pushing, are they aware of any of those things, or do they just git push and things are configured in a way that code appears in production? I’m still unclear how that path you see from git push - what happens and how much of that does this enablement or the platform team handle for the customers as you call them, the end users, the developers, value team members?

Yeah, value stream team members. Yeah. That’s not something we spent a whole lot of time working on yet, but is something that is on our roadmap to look into more of, is creating more maybe reusable or standardized pipelines for people that they can kind of order off the shelf, as it were, to make some of their changes better.

I’ll tell you something that we’ve been working on really recently is containerizing much of the local development environment. As a whole, it’s been really difficult for engineers to spin up new laptops and to get a laptop to install all of the necessary software on it. Like, I’ve got to install Docker, I’ve got to install VMware, I have to install all these different tools… So what if, instead, we had the IT department provision some of those tools for you on your fresh laptop build, and then all you had to do was clone a repo and do docker-compose up or docker-run and bring up a container that had further all of your Node tooling in it? Maybe it has npm pre-installed, it has Husky in it for doing all of your linting, and things like that. What if we could do that?

[48:24] That’s been something where we’ve been trying to publish some more example code and more example repos at the company. But as far as actual tooling around making reusable workflows and actions in GitHub, like GitHub Actions specifically, or analogously in CircleCI, making orbs that people could use - we haven’t done a whole lot with that yet. But that is I think a really cool practice that I’ve seen other places that I would like to do.

And you have the pipeline that gets the code into production… What does production look like? Because surely, it can’t be just the VPC. There must be a lot more to it for things to run. What does that look like for you today?

So it depends on the team. And this is where I think it’s important that– as Gunnar mentioned, we’re specialists in certain areas of Amazon, but we are not architects. We’re not there to tell you that you have to do it a specific way. We’re there to tell you what’s possible. The way I think I’ve written it down before is to tell you what’s permitted, what’s possible, what’s nonsense, or something like that. But our goal is to make it so that instead of being prescriptive in general, we try to make recommendations and then leave the decision in the hands of the people who know the domain best, because they’re ultimately the only people who can design the solution that’s necessary. They’re the only ones who fully know the constraints of their problem space.

Ultimately, as my team deploys on top of this platform with our AWS accounts, it’s given us the structure of the dev stage prod flow with our AWS accounts, which we didn’t have in our legacy environment, which was like two giant AWS accounts. And so for things that we’re developing maybe in Lambda or using an AWS service, we’re building that directly in the dev environment and deploying it there. But for things where we’re actually deploying a Docker container, we may run that locally.

Every team now has some version of a GitHub Action workflow that is get the PR up, deploy it to dev, run some sort of smoke test, deploy it to stage, automatically kickoff smoke test from a GitHub Action, and then deploy to prod. And some teams have automated that fully, where if they feel confident in their smoke test where if it passed, they’ll automatically promote it and run an additional smoke test and verify things are good. And my team has a lot of legacy components that don’t have that safety net built in, so we do things a little bit more manually, so that we can check things out and verify stuff on the site before we deploy straight up to prod, and things of that sort. But it’s given us that workflow that we all can follow and point our Terraform– each repo has, generally, Terraform that configures the resources itself to that dev stage prod account. And those are GitHub Actions that we’re able to share throughout all of our repos. And hopefully, with GitHub Actions reusable workflows, we can share them a bit easier than copy and pasting all the workflows everywhere, which is what you had to do for a while.

Okay. So CI/CD means GitHub Actions for you, mostly.

Mostly.

Terraform is there to provision the infrastructure and manage it afterwards. Okay, that’s interesting. So when a code–

Why do you say that’s interesting? Sorry, just to turn the question around for a second.

Okay. I was waiting for that. So the world that I imagine is where basically you can see what is running in production at any point in time, you know how production was proficient. This is a very important one and very controversial one - you push straight into production. There’s no staging. Dev is local. The idea is to make the smallest possible change and get it into production as soon as possible. Getting something out there doesn’t mean that users get to use it; it just means that part of a small slice of the feature that you’re working on is there. If it’s a fix, you want to know within minutes whether it works or not.

So I try to optimize and encourage others to optimize for time to production. And if that is a few minutes, that’s great. The more stages you have, the more steps you have, there’s the environments that you have to manage, that you have to be aware of, that you have to upgrade, that you have to keep in sync. It’s a never-ending problem. And in some cases, you need to have that. It’s not an option to not have them. But if you can, not having them speeds up the learnings, speeds up the experimentation, speeds up figuring out what works and what doesn’t, because ultimately, it’s our end users that we have to serve, and there’s a lot of proxies until we get to the end users; the actual person putting a credit card and paying for whatever they’re paying for, whether it’s a physical good, whether it’s a service… So it’s that exchange that we have to think about. Delighting our users, whatever that may mean - it’s not always like they pay money; maybe they give you attention. And we know big companies that made a fortune doing that, captivating people’s attention.

What I’m trying to say is that the quicker you can get it in front of the users, the better off you are. And the fewer the steps, the fewer the checkpoints, the fewer the sign-offs, or whatever needs to happen. So what are the systems that we need to design so that it’s as easy as possible to get that value out to the end user? And I’m thinking minutes. And if you have a dev staging in prod environment, there’s no way it can be minutes. It’s impossible.

I don’t know about that. I find that it depends on how you structure your pipelines that it can be minutes. So a great example of that is if you’re trying to deploy the entire stack with one push, then yeah, it’s not going to be minutes. If you have a really robust platform that you’re building on top of, then it tends to reduce some of the complexity.

So for example, we had, I think our Bootstrap deploy as of a couple weeks ago was like 45 minutes, because we were trying to redeploy thousands of resources in one shot, and that was obviously unacceptably slow for us to be able to make change within a day. If each PR – you know, if you were to go with the maximum time, or if each PR takes 45 minutes to deploy, then you’re constrained with an eight-hour work to being able to run like 12 of those, basically? That’s not great, and that means you have an upper bound on how much change you can ship in a day. And so we went through when we– in the autonomous fashion, we went through and parallelized how we do our deploys. We’re actually now using build matrices within GitHub to do the deployment itself, so we run each account as its own individual deploy job, which greatly speeds up and makes it so that our deploys happen in about 90 seconds a piece, as opposed to 45 minutes end-to-end.

That is a huge win, huge win. Okay.

Yeah, it was really great.

Wow, amazing. Okay.

So I think especially with Terraform and things like this, having layers where we say “This is platform; this is then another layer that’s maybe team specific, that’s like, these are resources that change infrequently. Maybe we put the database there”, put your networking stack there if it’s not part of your platform, put your CDN configuration, your S3 buckets, put those stateful resources… We call that the static layer; the layer that changes very infrequently.

And then we have the dynamic layer, which is where - if I’m a developer and I’m pushing up a feature fix, I want to get a really quick deploy on the platform using the cloud environment to see how it performs. And we write some of our pipelines so that we can have multiple copies of the application stack, multiple features being worked on at the same time. And the majority of our teams have something like that. It’s analogous to what you’re talking about with your dev, using local as a dev. I’ve worked in an environment like that, which is really great. One of the challenges of that - this one of the tradeoffs - is that you don’t get to see how does it perform in the cloud environment.

Now, with Docker, that’s pretty given these days. If it performs locally on the laptop, it’s probably going to work well in the cloud. But there are certain features that maybe Docker is not able to replicate, like SQS or SageMaker, how those things are going to perform.

And so sometimes having a way – in the cloud environment, having a sandbox area can be really helpful. We don’t mandate that teams use all of our dev and stage and prod pattern. The only thing we mandate is that you have a production account, and that you need to have some way of doing things outside of production to be able to test with as well. But that’s a really example of how to approach a problem; if you trust your observability, then why not push straight to production? I think that’s really compelling, and I’m going to get some mileage out of that, talking with some of my team members. Thanks.

[56:19] I don’t want to interject, but I have a comment about the people side of what you all were just talking about too, that I think is important. So I don’t know if you have questions you want to go off on, Tom, before I interject with that?

No, no, this is very important. The people side - it’s like the lynchpin that holds everything together. Go for it, Gunnar.

One of my favorite things about your show, Gerhard, is that– I can’t remember the exact phrase, but you always talked about the people that make it happen, because the people are what matters. And Tom and I have had a lot of conversations - we meet weekly - about, you can only ship as fast as the people are comfortable with. And so as we’ve moved from a legacy platform that had a very different way of deploying - it was very manual, very much manually verified; it was done every two weeks, when you merge a large batch of changes… You bake over eight years that kind of comfort and human process into the system, and so trying to unwind that, when you enable the system to move at a 30-second deploy or a 90-second deploy, but the company is used to very minute checks, because the system was very and bugs would happen all the time that you couldn’t check, because the testing wasn’t very good, and there were issues here… You get so much concern where you put so much process of checking different things and feeling that it’s hard to unwind that because it makes people uncomfortable. And so a lot of what we’re trying to do to move to that is help people feel comfortable with that change, and the speed, and baking that into how can we actually go about doing this in a faster way… Because you have to make the people comfortable with it and show and prove that you can deploy that speed with resilience. It’s hard to flip a switch overnight and go, “Hey, we’re going to deploy every 30 seconds straight to prod right now”, because people, product, everyone in the company, who’s seen the thing go, “Whoa, that’s a little scary.” So you have to kind of work slowly, do that and build up that trust to move there.

There’s ways that Tom’s describing that we can work that certain teams can’t, because Tom’s platform has been built without – it’s not seen by the company or things of that sort, but my teams that deliver value straight to the customer, we still have legacy systems that have a lot of that manual work that things integrate with, and testing that, and so there’s just a lot of unpacking how we’ve done that. And we’ve made improvements there, but that’s a huge part of this process of speeding up deploy workflows and pipelines, is getting the people on board. And I feel like my role is usually to ask the question like, “Well, why can’t we do that? We used to deploy every other Monday at 8:00 PM. Why don’t we deploy at 3,:00 PM during the middle of the day?” It’s like, “Whoa, I don’t–” There’s a lot of just, “I don’t know. I don’t know.” It’s like, “Well, let’s try it and see if it works. And if it doesn’t, if it blows up, then we’ll move back and we’ll figure out what works. And then you just kind of slowly keep peeling the onion to try to get to the point where you can move as fast as you want to.

That’s to your point, Gerhard, of have a metric, measure it, and say, “Is this working or not? And can I do deploys straight to production or not?” I think that if you have that number and you can – I specifically had in the early days with our core team, they came and said, “How can we do the dev stage prod pattern? How can we do some of our testing with it?” I think was the conversation. I said, “Well, one way to do this is to say, do the merge, merge everything to your main branch, and then run it. And if you trust your health checks, then there should be nothing to worry about. It’s going to go out, and if it blows up, it will halt, and you’ll roll it back and take the fix, right?” And they went off as a team and they talked about it and they said, “Yeah, we want to go down that path”, and now that’s how they run all of their code deployments. They have really good testing, and I think that’s a great feedback loop there, of making sure that “If it breaks, we add more tests. Or we had something, and we changed something about the way we did it, because we didn’t have enough safety that it wasn’t going to break.”

[01:00:02.07] That it makes me so happy to hear, because that’s exactly it. The way you’ve put it, Tom, and the way you’ve put it, Gunnar, that’s exactly it. It’s getting people comfortable, getting people confident, getting people feel like what they do has an immediate impact. They don’t have to sit on that for too long. What they do matters, and it can be seen day in, day out, multiple times per day. You don’t have to wait two week to figure out whether what you think actually works. And how do you get that feedback from your users if you can’t show them what you have in mind and ask them, “Is this what you meant? Is this what works for you?” And they say, “No. Close, but no.”

Back to your point on if it takes you 45 minutes to go through just like a small portion of that loop, there’s only so many learnings you can have in a day, in a month, in a year, and that’s what it all comes down to. And it’s not like you’re doing it right or wrong, it’s whereabouts you are on that spectrum. It’s always that, right? That’s the spectrum. And you are going in the right direction. It’s the small improvements that you have to take day in, day out because all of them take work.

Compound interest is an amazing thing. Keep doing that. And eventually, you dream it, and it’ll be out there. Maybe not quite like that, but that’s the dream, right? You imagine it, and they will tell you instantly whether it’s a yes or no. And iterate so quickly that everybody gets what they want just like that, and it works for everyone.

But there’s so many learnings that can be had from this… And even we, Changelog - it’s a fairly simple app. It’s a monolithic app, simple app. Our deploys used to take 10 minutes. Now they take three minutes. Eventually, they may take a minute, and then they go straight into prod. So how do we get there? It’s that journey that each of us is on. What do we learn on the journey, and then do we share those learnings? So I think we’re ticking very many boxes right now. That’s how I feel.

One of the things you remind me of is that I’ve had this sort of realization over the course of my career that - like, we create a lot of problems for ourselves through process that we add and things like that, and it’s important to always ask, “Why am I doing this this way?”, and that everybody understand why we’re doing it this way. I don’t know if anyone has ever talked about the gorilla story. I can’t believe I’m going to mention this.

No, but tell us the gorilla story. So this is not the chaos gorilla, right? This is something else.

I was a band kid growing up, and so I had a band director when I was in college who used to use the phrase “Go for the bananas”. And the story is that there– and I think this is something you can go find on the internet, but there was a famous study of gorillas with bananas on a pedestal in the middle of the enclosure where the gorillas were. And every time any one of the gorillas would go try and eat the bananas, all the gorillas would get sprayed with water. Gorillas hate rain. They hate getting sprayed with water. It’s very unpleasant for them. So over time, they stop going for the bananas. Then they introduce new gorillas into the enclosure. The new gorillas don’t know about the bananas and the water, and so they go after the bananas and everyone gets sprayed with water. And so then they learn too that you don’t go for the bananas.

And then eventually they take the old gorillas away and they add even more new gorillas, and they realize that the new gorillas, when they try to go for the bananas, the second-generation gorillas would stop them from going for the bananas. They would restrain them. They would beat them up. They would make them stop, because you don’t go for the bananas. That is not a thing. The bananas are evil, you don’t touch them.

And so from this, we can discern I think similarly with humans that if you beat some people up and say, “That’s not how you do it”, or “You must do it this way”, or “This is the way everyone else does it, so we have to do it that way.” If you do that, and especially if you do that from an authoritarian perspective in an organization and say, “You cannot use these tools, because we won’t allow you to do that… Because everyone else is using Lambda, so therefore, you must use Lambda”, sometimes standardization and other things like that can actually really hamper the organization, because they prevent you from finding new and innovative ways of fulfilling the objectives of the organization. And so it’s important to always have information about why we do certain things.

My team recently– we have new team members. We weren’t consistently running our Terraform plans. And I said that’s important that we run those Terraform plans, because they are one of our primary sources of quality assurance for our software, because we don’t have other testing types that are heavily invested in right now with our Bootstrap. And that was something where we realized once we did that that we had better deploys, more consistent feature results that everyone was looking at those plans. But we had to establish why do we do it this way in the first place, so that everyone has understood that that was important, or what was important. I think that’s the thing that’s really key.

[01:04:19.17] I think you were asking earlier about how do we document some of these things… It’s important to define some of that stuff. One way that we do this that I’ll offer maybe to listeners is that we have a centralized repository of documentation that we keep for the company, that we call policies, but they’re really just documents that describe best practices. So for example, we have a document that’s on networking best practice, and what are the ways in which to approach networking that is both secure and competent, and all those kinds of things. We have a policy around documentation, and it lists what are the appropriate places in the company to put documentation so that others can find it. You’re not restrained by that, but these are encouragements of where to find things and look for stuff. We have documents around particular parts of our process, around best practice for incident management… You know, make a Slack channel, make a Zoom call so everyone can join it, those kinds of things. I don’t think we specifically say Zoom and Slack; we say “Have a way of communicating with people.” And I think those are the important things - why do we do it this way, why do we record that stuff?

Sure. So do you go for the bananas, or don’t you go for the banana? Yes or no?

You do go for the bananas.

Always.

Yeah. And that’s what my band director would say, is that you always go for the bananas, and don’t just be restrained by institutional knowledge. Or my father will say “I think the seven or so worst words in an organization are “That’s the way we’ve always done it.”

Oh, yes.

You should never just say, “That’s the way we’ve always done it”, and leave it at that. You should always be questioning that, and seek to understand why.

So we have Tom’s key takeaway. I was going to ask him for one, like what’s important for our listeners, but that sounded to me like it was it. Do you have one, Gunnar?

As someone who – I’m a manager in the company, I’m the process – put in process and things of that sort. And I’ve had the same sort of takeaway like Tom, and I’d use, instead of gorilla and bananas, because I have a sports background, a football analogy of if you are a team, and you call a passing play, and you throw it, and you throw an interception… You don’t go, “Alright, we’re never running that play again. It ended poorly”, right? And I think that’s sometimes what process can be. It’s like, “Okay, we did this thing. We should never do this ever again”, or “We always need to do this before this.” And it’s really easy to put process in place, but it’s really hard to unwind it.

What happens when you do throw an interception - you go, “Well, we didn’t practice this well enough” or “We need to better decide what the defense was doing” and things of that sort. And it’s more about that retrospective elaborating on what you’re supposed to be doing and continuing to push forward, and not trying to just if you fail, just write it off and never touch it again.

And so that’s been the thing for me; that’s something I always asked, and I ask my team about – I have random sayings… You know, when we have an issue or something in production and there’s that desire, there’s pressure from the company where it’s like, “Well, how are you going to make sure this never happens again?” It’s like, “Well, it’s easy to say “Okay, we’re going to do this and we’ll never push this button again” et cetera. But with the team, it’s like, “Do we really just throw an interception here? Is our process good and we just need to think about what actually went wrong here and we need to practice this a bit more?” and things of that sort. So that’s been my takeaway as well from this. It’s pretty similar to Tom’s.

Well, I would like to thank you two very much. This has been an amazing conversation. It went into a direction which I was not expecting, and I’m glad it did. I thought we got so much out of it. I appreciate your time. Thank you very much for sharing your story with us, and I’m looking forward to all the reading and all the interesting links that you have, I’m sure, and all the resources, because there’s so much more to this. This is just the beginning in my mind. So thank you, Gunnar, and thank you, Tom.

Thank you, Gerhard.

Yeah. Thank you.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00