Ship It! – Episode #14

Cloud-native chaos engineering

with Uma Mukkara and Karthik Satchitanand from Chaos Native

All Episodes

In today’s episode, Gerhard is joined by Uma, CEO and co-founder of ChaosNative, as well as Karthik, CTO and also a ChaosNative co-founder. They talk Chaos Engineering and Litmus.

Chaos Engineering is not just for super SREs. It is not meant to prevent outages. And, it is not just about hardware. Chaos Engineering is about testing how reliable your systems are. It’s meant to show you how things fail, including when other dependent systems fail - think cascading failures. This is a good way to discover inconvenient truths about that beautiful code that you wrote. Everything fails, and great insights are to be found when it does.



RenderThe Zero DevOps cloud that empowers you to ship faster than your competitors. Render is built for modern applications and offers everything you need out-of-the-box. Learn more at or email for a personal introduction and to ask questions about the Render platform.

SentryWorking code means happy customers. That’s exactly why teams choose Sentry. From error tracking to performance monitoring, Sentry helps teams see what actually matters, resolve problems quicker, and learn continuously about their applications - from the frontend to the backend. Use the code SHIPIT and get the team plan free for three months.

Teleport – Teleport Access Plane lets you access any computing resource anywhere. Engineers and security teams can unify access to SSH servers, Kubernetes clusters, web applications, and databases across all environments. Try Teleport today in the cloud, self-hosted, or open source at

FastlyOur bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at

Notes & Links

📝 Edit Notes


📝 Edit Transcript


Play the audio to listen along while you enjoy the transcript. 🎧

If the network is not reliable, and that was a thing that it is, they will be in for a surprise. Unless you’ve had some network outages, or some packet loss, or even like at your home, you always think it will work, there won’t be any problems. Well, it doesn’t work like that in reality. Disks fail all the time. The more infrastructure you have, the more you realize how just often disks fail.

CDNs fail all the time. In episode #10 we talked about how Fastly failed for a few minutes and half the internet went down. That was an interesting one. So how do you know that the beautifully-crafted code that you ship continuously, it’s test-driven, it’s beautiful, it gets out there - how do you know that it’ll continue serving its purpose when failures happen? Because they will happen. I think, at least in my mind, that’s where chaos engineering comes in. But what is your perspective, Uma?

That’s right, the chaos engineering - it was being viewed in a little bit different way earlier. It was purely for stopping the outages; SREs were tasked with the - you know, “I tried everything, but now can you do something? You are a super-SRE.” “Okay, let me do chaos engineering.” But I believe it’s changing.

[04:15] Chaos engineering is a little bit more than a tough piece that is meant for super-SREs. Now chaos engineering is more of a good and easy and a must-have tool for DevOps, as long as you’re trying to improve something on reliability. You’re right, it’s about reliability; nothing is reliable, not just networks. Almost anything that you deal with, different software is unreliable. It is reliable to some extent, but not a hundred percent.

So I believe chaos engineering is still an evolving subject, and it has evolved in the last few years from being purely a fascination for an SRE, as an expert subject for an SRE, into more of a must-have, good-to-have tool for all sorts of roles, ranging from SRE, all the way to developer. That’s my observation at least.

Okay, chaos engineering is important. It’s an evolving topic, it’s a field that changes quite a bit… It’s chaotic (pun intended). So why is it important, Karthik? Why is chaos engineering important for shipping code and writing code? What is the link there?

I think you mentioned about Fastly going down for a few minutes, and that took half the internet down with it… And I’m sure it has cost a lot. Downtimes are extremely costly. You would want to avoid them dearly. And there is enough motivation for you to test how reliable your systems are.

Like Uma mentioned, it’s not something that you only do in production, though that is fair, the benefits of chaos engineering has been most realized for the last decade or so… But it is important that you go ahead and test your systems, because there is so much changing there in your deployment environment all the time.

In today’s microservices world, the application that you’re deploying in your deployment environment - it could be Kubernetes, or it could be somewhere on the cloud - there are so many other moving parts that you depend on to give you that wholesome experience for the user. Things that help developers support SREs and things that are viewed by the end user. There are so many components in the deployment environment which cater to different audiences, help running the entire system… It’s possible any of those components may go down, relaying in varied degree of degradation and user experience. It could inhibit your support team from serving customers better, or the customers might have a direct impact, not being able to use your service. That is something we would like to avoid.

Chaos engineering is a lot about learning your systems as well. Many times we assume certain infrastructure aids while developing code, which turn out to be untrue when you’re actually deploying it… And you really want to know what’s going to happen when things fail in the infra side. So yes, I think that is really about chaos engineering, as to why that is important.

So taking that, how do you chaos-engineer a CDN? That’s just one that you have in your system… How do you apply chaos engineering principles to test the resiliency of your CDN? Can you do that even?

[08:07] I think ultimately you would host your CDN on infrastructure that you’re either putting on your own data centers, or on the cloud. So ultimately, anything that’s powering software is ultimately built on some platform… And you could ahead and start off by checking what happens when something fails on the platform. It could be a disk, it could be a network, it could be some resource exhaustion that you’re seeing on the platform that is hosting the CDN. And I’m sure when you’re building the data network, you’re still ensuring that data is spread across different machines, different regions or areas, and you’re somehow building some amount of resiliency into how the data is served [unintelligible 00:09:00.20] end users as part of a CDN.

So you can check the extent of high availability that you have built in by targeting some very simple infrastructure faults. I would say that would be a good starting point.

Okay. What do you think, Uma? Anything to add?

I mean, the CDN is a complex topic. Which part of a CDN are you talking about? Delivery, your networks need to be reliable, your supporting infrastructure needs to be reliable, and the software that runs the CDN needs to be reliable.

The idea of applying chaos engineering to your CDN is to improve something that’s already mostly reliable. Today, a CDN is reliable. We all work on the internet. But it is services like Fastly going down once in a year, or even less often.

Yeah, the first time in five years for us.

[unintelligible 00:10:02.10] “Hey, something has happened, even though I applied chaos engineering to it.” In reality, it’s not that simple, in my opinion. Site reliability itself is in engineering. Chaos is in engineering. So engineering comes with understanding what’s going on, and there’s no unique way of saying that this is exactly how I’m going to fix. It’s going to depend on what is the problem in that given situation.

So I would say you can apply chaos engineering not just only to a CDN - to any other system, but really looking at the way the services are architected or deployed. And look at the services and see, “Is there something that I can see as a low-hanging fruit that’s either doubtful about reliability, or constantly causing me trouble? Let me go and attack that, debug that.” Then the one way to debug that is “Can I actually introduce a fault on the scene?” So you need more ways of reproducing the faults, and then you go to your SREs. SREs generally go and try to fix stuff create quick recovery points, or try to avoid that dependency on that failure… But really, you need to go back to developers to really fix the root cause of it.

So if you ask me to summarize the whole chaos engineering for a CDN, it needs to be at different levels, in cost structure and cost structure again is storage and network. If I recollect some of the scenarios that I heard of, it’s always about a slow storage that caused more of a bigger issue all of a sudden and it never happened, the storage slowness. Or networks usually are very tolerant in terms of faults, but still, double/triple faults can happen.

[12:18] So one is about verifying how reliable is your infrastructure dependency. Try to introduce some slownesses intentionally and keep verifying your CDN continues to work. That’s one level. The other level is take a look at your services and how reliable you are, and then if networks go slow, or storage goes slow, do you have a software that is reliable enough to switch on to something else, or do something that’s more proactive to continue serving the data.

So as I said, it’s engineering, and that’s why we need good tools for site reliability engineers. That’s chaos engineering.

That makes perfect sense to me. So if I had to summarize what chaos engineering is in one short sentence, to me that would be the injection of artificial faults - they’re not real, they’re artificial; they’re made, we make them - to see how the system as a whole reacts to those artificial faults. Would you refine that, add something more to that? What is chaos engineering to you in one short sentence?

I’ll probably take a crack at it, and I think Karthik can give probably a better answer. I usually separate chaos and engineering as two different words in my mind. People always think chaos engineering is chaos. To me it’s easy to introduce chaos. Of course, you have now better tools. It’s faster to introduce chaos, but I would give more preference or more importance to the engineering side of chaos engineering. It is always about what should happen when you introduce a fault. A very simple fault, a very simple service, if it fails, how you react to it is always well tested. Your devs, your SREs, user acceptance tests… We are living in the modern day; all those systems are now very modern.

But failures do happen because something unexpected, untested has happened, and now we are looking at chaos engineering as a way to unearth those faults in a willful manner. So what is chaos engineering? In that sense it’s when a fault happens, what should you look for? How do you actually search for a fault? So that’s the steady-state hypothesis. I go and look at what is my stead state; you can look at just one service, or look at many services together. And if you define the steady-state hypothesis that is closer to your business or a business loss, then you will come to chaos.

The tools and the strategy design should go towards thinking more on the engineering side of “How can I avoid a certain loss?” or “How can I unearth a complex scenario or a complex faulty scenario?” and then I can split that scenario into multiple [unintelligible 00:15:37.14] And then that becomes easy, actually. So it’s engineering, that’s the way I look at it.

I’m really looking forward to Karthik’s question, but before that, I would like to ask you, Uma, how do you look at a system? How do you look at the steady state of a system? What do you use?

[15:57] I would generally define the system in the minds of people who are the [unintelligible 00:16:03.27] and what keeps SREs and the management of the SREs up at night. So it’s something that is closer to business criticality, the service. So that’s what the system to me is. It is not really about the technical stack; technical stack comes later, and that’s where we introduce chaos. But the system really is about service and service catalog, hierarchy of services, dependency of services. This is what the system is.

So I would go ahead and define that map, and identify the criticality points, and then start thinking about manually to introduce a fault, what all will shake up, what can loosen up or what can fail, and who will wake up first before the customers start screaming too much. So that’s what is the system in my view, where you’re going to apply chaos engineering on.

So what I’m hearing is that not only you need to know all the services that make up a system, but also what does it mean for end users to be happy when it comes to using that system? So you define all the services that make the system, and also what does healthy mean for every single component in the system. And that is your steady state. Steady state is define what happiness means for your end users, capture that somehow - I imagine dashboards, metrics, logs… No?

Yes. Again, it depends on how evolved or structured the system is. It’s really about good dashboards, if you have, and you’re using a good service-level object scheme, then you have a system that you are looking at. And if we’re only measuring how often the faults are happening, and if you are really depending on how happy my customers are as a metric generally, how reliable your systems are, then you are in for a surprise. Yeah, you should have a good schematic of the service-level objectives.

That is a great answer, very complete. A lot more comprehensive than I was expecting, but it was very, very good… Which comes back to Karthik. The question was - I know we talked a lot, so let’s restate the question… The question was “What is chaos engineering in one short sentence?”

I think we are living in the times of the pandemic, so let’s call it “Injecting harm to build your immunity.” Just that instead of injecting harm into human beings, we’re doing it on systems. So I would define chaos engineering as that. Uma made a good point about steady-state hypothesis - I think when Netflix and Salesforce and Amazon, all these folks put together the principles of chaos a long time back, the main data is the central piece of the discipline of chaos engineering, along with recommendations to try different kinds of faults, and run chaos continuously… Because you never know when the system behaves in what way, because of what change induced into it.

So yes, I think chaos engineering is a lot about scientifically trying to understand or mapping user happiness to metrics and logs and events; steady state can be very diverse, and in today’s age, that diversity has just increased. You could be talking about metrics, you could be talking about the availability of some downstream service, or it could be something on your clusters. So we are talking about resources in Kubernetes, it could be the state of a resource… And there are custom resources that extend the traditional Kubernetes capabilities to a lot of domain-specific intelligence, so being able to validate that info is also part of steady state…

[20:17] So I think yes, chaos engineering is about willful fault injection, like you mentioned, Gerhard. Artificially inducing faults in order to verify how the system is behaving, and have good means of identifying the [unintelligible 00:20:30.16] steady state, and checking whether it is within tolerable limits or no.

Then it’s all about doing it continuously, then going back to the drawing board, fixing your application, business logic, or maybe your deployment practices, coming back and [unintelligible 00:20:47.09] proceeding with the next possible outage that you can think of.

Break: [20:54]

This doesn’t happen often, but I was talking to one of our listeners, Patrick F. in Slack, and he has a question - more like a suggestion - which I think is a very good one to bring up in this interview, in this conversation, in this episode. Patrick is saying that he would love to hear about practicing inefficiencies or applying non-best practices in small doses. I know it’s not exactly the chaos engineering that we discussed, but I can see an overlap between doing the wrong thing on purpose and chaos engineering. What do you think, Karthik?

I think it makes sense, and I think this is especially true when you’re trying to find out how good your security systems are. There’s an entire new category, or a subcategory within chaos engineering for security chaos engineering, which people are trying to find out how reliable their systems are in terms of security by introducing some vulnerabilities deliberately.

I can relate a lot to Patrick when he says running things in the non-best practice way. You can run privileged containers, mount [unintelligible 00:23:05.27] and basically try and see how your system behaves; is it being called out? Do you have the right policies that restrict you from doing so? These are things that you would want to find out, and not just for security. I think that’s probably one thing that comes to mind straight away… But even for other scenarios maybe. We talked about running single replicas of applications.

Sometimes you would want to see what is the recovery time of your app. Let’s say you were not running multiple replicas of an application; you were just going with a single replica, and there was a failure. You might want to figure out how best or how quickly you’re able to recover. Maybe reschedule and bring up once again, register [unintelligible 00:23:52.02] and then start serving data once again. How quickly does this happen?

[24:00] Sometimes you might want to run in modes that are not classified as the best practices. You would still learn a lot about your system by running that way. So that’s something that should be done, but most probably on staging environments or development clusters, because you would not want to attempt this in production… Because these are things you would still learn anyways while you’re running it even in a non-prod environment.

Anything to add, Uma, to that?

Yeah, it’s actually a very interesting question… You were saying Patrick is asking “Should we implement non-best practices or inefficient practices?” I’m saying the same thing when I say chaos is a best practice. It’s a must-have. That really means that you in turn use non-best practices in production [unintelligible 00:24:53.02]

So your best practice is do everything right. Chaos engineering says “Break something. Don’t assume that everything will happen.” So the best practice is to have chaos engineering. That means the best practice is not to follow always the best practices that you are asked to follow. And the result of breaking things on purpose or willful fault injection - you will improve your best practices. That means you did follow some non-best practice, and that unearthed something, so you tuned your best practices.

So I would say he is 100% right, and he’s just put it differently. We are putting chaos engineering as a more polished word, but it’s an absolute thing. No one can tell everything will work well.

I always keep going back to how many learnings I personally used to take from fire drills, or even Red Team Thinking. That was a very powerful one. But taking a step back and summarizing this - you tend to learn more from failures than from successes. So when you fail, there’s a lot of learnings there. When you succeed - sure, but maybe it doesn’t feel as significant. Maybe also because of the loss bias, I think. When you lose something, it feels worse than when you win something. I think it’s rooted in that loss feels bigger, like “Oh, what?! My database was deleted?! Oh, no!” Versus “The migration just worked. Sure, it’s okay. No big deal.” I think that’s the way to think about it.

Okay, so that was a good one… Hopefully, Patrick got what – well, not what he was expecting, but got something good out of this. Now, I would like us to go into a specific use case, and I keep bringing this one up… The application. We are in a unique position to be able to experiment and learn new things in the context of the app that runs all our shows, all our podcasts. That’s pretty unique as far as I know… So is a monolithic, three-tier application. There’s a frontend, a backend and a database. It’s single instance, for various reasons. Episode ten has all the details. And I’m wondering, if we were to start using chaos engineering practices, which from what I’m hearing, they’re mostly targeted towards microservices; I think that’s where they shine… But what chaos engineering practices could we use for our application, just to see how resilient it is?

I think chaos engineering is as applicable and important for monolithic applications as they are for microservices. Sure, I think its adoption has been increased because of all this paradigm shift to microservices, and the fact that you have more possible failure points; the surface area for failures is much more with microservices… But that’s not to say that it cannot be applied in principle to monolithic applications.

[28:01] In spite of being a monolith, there are some amount of dependencies that you would still have… Let’s say infrastructural dependencies. We talked about databases being used as part of the stack; it’s very much possible that the disks become slow, your writes become very slow, it’s possible that you have space getting filled up, you don’t have space anymore to write things. How you’re going to behave as an application that’s probably very read-intensive, and you are having some problems, but you still have enough in place to keep the users happy when you are able to record your systems manually.

So this is something that you would still check, even if you were running a monolithic application. And that’s true for a lot of other infrastructure components as well. When you do chaos engineering, there are two ways of deriving the scenarios to get started with chaos. One approach is a completely explorative approach; you take a look at the system, you identify “These are the things that could go wrong”, and then you start going out and doing those control failures and noticing your system and how it behaves.

The other way of deriving scenarios is to look for data, historic data of what has gone wrong before, and what is the most problematic area. How many times did I have to grow my volume? How many times did I have to increase the CPU course on my system? When there was a lot of interest, a lot of reads, a lot of traffic, what was the component that I needed to be most careful about, which displayed - not erroneous characteristics, but characteristics that you would not identify as optimal behavior. And then you go ahead and derive the scenario from there and go ahead and do it.

So that pattern is common for both monolithic, as well as microservice applications… But the general concept of chaos engineering still applies here, too. It’s just that the failures here might be more tied to the infrastructure, rather than something that you would think of in case of a microservices world, where the dependencies and co-services that you are running along with your main business app offer as much as food or as much possibility of failures, rather than the hosting infrastructure, I would say.

So what tools could we use to do all those things? Is there a tool that you would recommend that we pick up and try simulating these scenarios, or faults, whatever you wanna call them?

Yeah, you were asking two creators of LitmusChaos project what they would use… Of course, we both recommend –

Maybe not LitmusChaos…? [laughter] It can happen… Unlikely, but…

If you want to run into real chaos in chaos engineering don’t use Litmus, but if you want to stay organized in chaos engineering, you might choose Litmus.

Yeah, the idea of Litmus Chaos is to make sure that we provide a platform, not just an experiment. As I mentioned earlier, chaos engineering is real engineering. You go through managing the experiments, you’re managing the steady-state hypothesis logic, and you keep changing it. You’re not happy with what you did the last time. So how do you manage it? In your system there are multiple versions of it.

[31:42] We needed a platform in our prior work life, that’s when we looked for some good chaos engineering tools and started writing Litmus, and it became more widely adopted. I would say you can start with Litmus, and Litmus is just a chaos engineering platform… But for you at Changelog I would also recommend best practices as – first of all, you need to play the role of a person outside the system; try to discover, don’t assume too much about how your system works. Start with [unintelligible 00:32:15.27] apply the logic of “Something will break when I do something crazy.” That’s what is [unintelligible 00:32:24.09] and then that brings some good unknowns hopefully the first day, and then it shakes up your co-workers, and your management, and then you start putting a better, holistic approach.

Then I would also say as a prerequisite you need to have good metrics, or a dashboard, even before you apply chaos engineering. Do you have a good monitoring system? Because when you actually do apply, it breaks, but then you need to be able to take care of observing what has gone wrong and “What do I do now?”

So it all goes hand in hand, and discovery, reliability metrics, an observability system - all those things need to be in place, and then start with probably the backend, in infrastructure. And even though it’s monolithic, you can still apply some service-level chaos such as push too much traffic into one of the services that you use less, but that can cause stress on overall systems… And then there is a lot that you can do when proactively in your pre-production environment. Try to start there and learn, and then go from there either right, or left, into production. You may find something that you can improve on your pipeline, so you can go on an introduce these failures into your pipeline. That might be a good place for the overall efficiency of your DevOps.

So when it comes to starting with the Litmus platform, I imagine we would need to have an account on this platform? It’s not something that we would run, is that right? A litmus is a Kubernetes application. It’s not SaaS. So it’s a Kubernetes application, completely open source. it’s a CNCF project. You take and install Litmus on Kubernetes. It’s [unintelligible 00:34:29.07] you can log in and you connect wherever you want to run chaos. From there you connect to chaos center, and you can then pick up a chaos experiment or a fault, and direct that fault towards your target or to the agent.

You can run it on your existing Kubernetes, or spin up a small Kubernetes cluster to run Litmus. It is quite thin, but it is a Kubernetes distributed application. You can scale it up. If your hundreds of QA SREs are using a single instance of Litmus, it can scale up easily.

Do you install it as a Helm chart? Is there like an operator that comes with its own CRDs? How does it get installed on Kubernetes?

Yes, you’re right about that. You do have a Helm chart that helps to install the control plane of Litmus. As part of the setup process of the control plane you would go ahead and set up the account. The account is most probably about the users, who’s going to do the chaos…

[35:53] The next part is about the agent infrastructure. This is the environment you’re going to actually do the experiments in. This can be the same place where you have the control plane installed. Uma mentioned that Litmus runs as a Kubernetes app… Or you could have other clusters in your fleet, where you want to do chaos, so you would be registering that into the portal. And that is where the operators and CRDs get installed, as part of the agent setup, and you can then go ahead and construct scenarios, or workflows, as we call them, to the Litmus center, the chaos center, and then they get executed inside a cluster where the agent takes responsibility of playing the manifests, the custom resources, and then reconciling them, and then actually doing the fault injection and steady-state validation process.

So I’ve seen somewhere - I don’t remember where - Argo CD being somehow related to this as well… What is that relationship between Litmus and Argo CD?

We use Argo workflows as part of the chaos scenario construction. We chose Argo workflows for its flexibility to order or sequence faults in different ways [unintelligible 00:37:10.13] We’ve instrumented the Argo workflows with some Litmus intelligence. The containers that carry out the steps within a workflow understand it as API. So they are [unintelligible 00:37:23.01]

The Argo CD part - I’m sure you might have heard of it more around the GitOps support that Litmus offers.

When we built Litmus, one of the things we wanted to do was somehow weave in the chaos engineering aspects into the standard GitOps flow that people are beginning to use… And people are trying to use GitOps to ensure the applications and infrastructure is maintaining a single source of truth, that is Git, and ensure that what is on their deployment environments match what is in their source. And there are controllers, also called as GitOps operators, which ensure that your applications are upgraded whenever they change in the source etc.

Oftentimes we see that people who’ve upgraded applications in their environment [unintelligible 00:38:17.22] or they have deployed new infrastructure want to verify its sanity. And one means of verifying sanity is by performing some chaos experiments, along with a specific expectation of what’s going to happen. And they already have a hypothesis in mind that they burn into the experiment definition. The experiment has the ability to specify validation intent within it.

People want to do those sanity checks whenever they’ve upgraded their infrastructure or upgraded their applications, and it was done in a manual way, so we wanted to automate that and provide these users or this person with a main store on chaos experiments automatically when something is changed via the GitOps operators. That’s when we brought about the event tracker functionality within Litmus. It runs as a separate microservice in your cluster.

So whenever Argo CD upgrades your application on the cluster, you have the option of triggering a predefined or a presubscribed chaos workflow against it. That happens via a call to the chaos center from the event tracker service running in your cluster.

So that is the relation that we have with Argo CD, and it is true for other popular GitOps tools as well. It could be Flux, or Keel, or you might have built in something with your own – you might have written some tooling by yourself, using Helm… So you have the option of triggering Litmus experiments or workflows as sanity checks post a standard GitOps operation.

[39:59] There’s another angle to it… Litmus also supports GitOps for the chaos artifacts. When you construct chaos scenarios, these workflow manifests can also be stored in Git or committed into Git automatically. When you make changes to the chaos workflows in your source, you will have those changes reflect on your chaos center as well. So that is another aspect of our way of looking at Litmus with GitOps.

Okay, that makes a lot of sense. I’m starting to form this mental model in my head of how all this fits together in our setup. I can start seeing the integration points… But what I’m wondering now, Uma, is if someone doesn’t have Kubernetes, how would they start even using this?

So when you talk about Litmus, you need Kubernetes to run the chaos center, where the control plane of chaos engineering is put together, where the SREs and developers interact with it, and where you interact with the chaos experiments that are stored on a hub, or on your private Git repository - all that is running as a Kubernetes application. So if you don’t have a Kubernetes environment, and your chaos engineering needs is for a non-Kubernetes environment, you just need to spin up a small Kubernetes cluster to post a LitmusChaos center, and then you can still create chaos scenarios or workflows or experiments towards your monolithic legacy applications or the regular infrastructure chaos [unintelligible 00:41:47.27] in a cloud, or on virtual machines, all that stuff. So Litmus does not work just only for Kubernetes, it works for everyone… But we’ve built it as a cloud-native application for all the good reasons.

Break: [42:04]

So this is a very special topic for me… The reason why it’s special is because I disagree with Kelsey Hightower about running databases on Kubernetes, and I learned it the hard way (again, pun intended), that if you run databases on Kubernetes, the database needs to be built for a distributed system that comes and goes very quickly, failures are intermittent and they can take miliseconds… It can mess up with replication. That’s actually what happened in our case when we ran a PostgreSQL cluster on Kubernetes. We tried Crunchy Data, and we also tried the Zalando operator - so we tried both - and in both cases our replica fell behind. The write-ahead log just stopped replicating, and then the master (or the primary, shall I say) disk filled up, crashed, couldn’t resume, couldn’t restart, because the disk was full, the write-ahead log filled the disk, the replication got broken… And we couldn’t promote the follower to be the leader, because it was too far behind. So we had downtime, we lost some data.

[44:22] So what do you think about running databases on Kubernetes, Uma? I know you have a bit of experience in this area, that’s why I ask you first…

Yes. Litmus [unintelligible 00:44:29.13] trying to fix bugs when you’re trying to run databases on Kubernetes. So I kind of have an opinion that you cannot have an option of not running databases on Kubernetes forever. Five years ago that was not a requirement; two years ago people thought it’s very, very difficult. Now I think there are mixed opinions; there are people running databases on Kubernetes, and there’s a good, active community, data on the Kubernetes community… Things are improving, and it is an evolving subject, and tools are coming in. Databases are also changing, so the stateful sites are the root elements within Kubernetes that are enabling distributed databases. But at the same time, there are storage elements that are being built or improvised for running databases on Kubernetes.

For example, my earlier project, OpenEBS, which is still a popular subject in this space, is having the concept of containerized storage. So you try to consider the storage as container an element that is built for running data on Kubernetes. And similarly, there is an element of local PV that is started by Kubernetes itself, and there are solutions being built on top of local PV. What happens when [unintelligible 00:46:09.28] goes down.

So I would say there are people who are running data on Kubernetes. Because the infrastructure also becomes a microservice, you need to understand that there are more failures that can happen. Storage is not guaranteed to be running in one place. It can [unintelligible 00:46:30.01] and how do you actually handle that situation, handle your application to do that? So just assume that it’s not just your port that can just go off and come back in. Assume that your storage also can go off and come back in. So it’s a natural thing. That’s why your applications just need to be aware of such scenarios and build it for more resilience. Chaos engineering as chaos-first is a principle that can definitely help in all these things.

So hopefully in a few years from now there will be questions like “Oh, we thought data on Kubernetes is not [unintelligible 00:47:09.29] but I see many people running it. That would be what will happen, in my opinion.

I would agree with that. I think there is a process of – as you mentioned at the beginning of the interview, it’s evolving, so I think the storage, the data layer is evolving on Kubernetes… But also the networking I think is evolving. Because in our case, the one that I mentioned earlier, it was networking, high network latency, very high packet loss, which just messed up the replication in PostgreSQL. So it wasn’t specific to any operator, by the way. It wasn’t Crunchy Data’s fault, it was not Zalando’s fault, the operator themselves - that’s what I’m referring to - it was just the network was just messing up with the PostgreSQL replication. That’s what the problem was.

[47:59] In other cases, for the app itself, when we had a three-node Kubernetes cluster - by the way, we have a single-node one; I know it’s very contentious, but guess what, it works better. So reality says and the practicality says it works better. The point is when we had three nodes, those volumes that should have moved around, the PVs - they didn’t. They were stuck, and they couldn’t get unstuck from the node that went away. And because remained in these stuck states, couldn’t dethatch, they couldn’t be reattached to other nodes. So that was a bit of a problem as well, which hit us. I know that things improved and they evolved, but I don’t feel they are there yet, especially if the database was not built to be a distributed one from day one.

What I’m wondering now, Karthik, is if there is such a stateful system, which was built to be distributed from day one, it understands that and it’s in its DNA, is it easier to run in on Kubernetes? I’m thinking maybe a message broker that was built to be distributed. It still has some state, but it works as a distributed system. What do you think? Does that make it easier?

Yes, I think to a great degree it does, but the network problems are not going away anywhere, Gerhard. If you take a look at the Litmus Slack channel on the Kubernetes workspace, network latency and network loss are probably the most popular discussion items. People are trying those experiments much more than they’re trying other experiments… So it is something that will continue to be there. As the network also evolves, with storage and all the other concepts in the cloud-native world, we will still have to address these network problems once in a while.

Message brokers is a good example, and in fact, when we’re trying to build some illustration for application-specific chaos experiments with Litmus – so application-specific chaos is a category of chaos experiments in which the experiment business logic has some native health checks that are specific to an app, and they also consist of certain faults that are made to a particular app. These could be just the standard faults applied within an application context, or they could be some faults that are very native or very specific to a given application type.

The first application-specific experiment that we considered was Kafka. We have some communities that are actually trying out Litmus against Kafka. Strimzi is one of the Kafka providers whom we are speaking with and trying to collaborate on, trying to find good scenarios that can be used as part of this thing.

What is relevant in the message broker world is - let us say you have some very intelligent message broker that is capable of handling message queues, and doing failovers, and doing elections, and things like that… Because here also there is some amount of state involved, so you have storage at play, you have network at play, you have all these things.

One of the scenarios that we got started with was killing a partition leader, which could also be a controller broker. Then you have a series of things happening. You have reelections happening, you basically trying to speak to Zookeeper, and you’re trying to ensure that the failovers happen quick enough so the consumers message timer is not breached, off session timers are not breached. These are thing you would still want to find out… These are good experiments you would still do in these kinds of environments. The first, from infra to infra. When we did this Kafka experiment on AWS, with the standard EBS-based storage class, with the AWS ENI, versus when we did it against GKE, with the GPT-based default storage class and [unintelligible 00:52:04.04] we saw there was a difference in the recovery times, and we saw that we needed to set different timeouts at the consumer [unintelligible 00:52:12.08]

[52:16] This experiment was a simple [unintelligible 00:52:17.02] You will have the need for chaos engineering in these environments as well, both to learn about the system, as well as prove some hypothesis that you might already have around timeouts and such settings that you have. So to come back to the earlier question, will data on Kubernetes become simpler when application architecture evolves to becoming distributed? Yes, I think that will definitely help… And I’m just trying to tie together chaos engineering there.

The adoption of data on Kubernetes can be accelerated, much in the way general Kubernetes [unintelligible 00:52:53.20] can be accelerated through chaos engineering. There are folks in the Litmus community, and I’m sure there are other projects speaking to such users as well, where they want to use Kubernetes in production, but they are not really confident in doing so. And they want to set up staging clusters, test out a lot of failures; failures on the Kubernetes control plane itself. You have your schedulers or controller managers going for a toss. You have Etcd going for a toss. And then you’re also trying to see what happens when you kill pods.

The multi-attach error issue, as we typically like to call it, the volume not getting detached from one node, and therefore it doesn’t get attached to the other node - this is something we’ve found very early in OpenEBS using the chaos experiment. And something has come up in the [unintelligible 00:53:49.06] to fix it today. OpenEBS has taken those fixes on board.

So I think both the application architecture, the data architecture becoming more distributed, as well as evolving chaos engineering practices will ensure that the adoption of databases into Kubernetes, as well as the general Kubernetes adoption itself will increase.

I think the most important point that resonates with me that you’ve made, Karthik, is around the different platforms having different recovery times. I think that’s really powerful, because if you are, for example, as we are, running on Linode, we cannot apply the same approaches that someone may be running on GCP, or someone running on AWS. Infrastructure matters a lot. So then how do you know how does it behave in your case? Well, one solution would be to maybe apply LitmusChaos and see how it behaves in practice. Also, not to mention that you do upgrades to your Kubernetes. Things improve most of the time, but sometimes they get worse. So how do you find out what got worse before rolling in production and everything just failing over, and hopefully failing over? And other times just failing in unexpected ways. So how do you preempt some of that?

[55:14] And we all know that as much as we want to be confident from our staging experiments, the best failures happen in production. So as much as you can try to preempt things in staging, until you go into production, you won’t see it. So maybe trying to generate production-level load, if it’s possible? It’s not always possible. That would help.

So as a listener, if I had to remember one thing from this conversation, what would that be, Uma?

Yeah, so the last stage of reliability is to be able to confidently generate random triggers after you apply every change to your system in production. So you upgrade it, you have a good CI/CD system, and you apply the change in production, but also [unintelligible 00:56:11.16] to create a random fault because of that change. And if you are still confident, that means you are testing well. And it takes time. Chaos engineering, starting in some form in pre-production or in QA, it all helps reaching that goal, but always remember that unless you are doing that confidently, breaking things confidently, your systems are not reliable. You can just assume that they are reliable, but they’re not. So use chaos engineering as a friend.

What do you think, Karthik? Do you agree with that?

Doing chaos engineering in production is the ultimate stage, the Nirvana of a very mature practice that you’ve set up in your organization… So start small, and explore a lot of failures, and establish a culture of continuous chaos at all levels. Chaos has become more democratic, more ubiquitous nowadays. The philosophy of chaos has sort of percolated to all [unintelligible 00:57:20.00] like Uma said earlier, from developers, to QA engineers, to SREs.

So go ahead and perform chaos, and then you will be able to confidently deploy your applications and sleep better at night.

Thank you very much, Karthik, thank you very much, Uma. That was a great thought to end on. A very powerful one. So yeah, go forth and break things, that’s what we’re saying… In production, by the way. Because until you do that in production, it’s okay, but it’s not great. So for a proper challenge, the ultimate frontier some call it, go in production and break things and see how resilient your system really is… Because those are the real failures that matter, or the only failures that matter. You can learn from all the others, but the production ones are special. So the sooner you get there and the sooner you start applying these practices, as Uma and Karthik described, the better off you will be, the more resilient your system will be. And the system doesn’t mean your stack, it means the value that you deliver to the people that use your system.

Thank you, Uma, thank you, Karthik. It’s been a pleasure. I hope to see you again soon.

Thank you, Gerhard.


Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00