Ship It! – Episode #82

Red Hat's approach to SRE

with Narayanan Raghavan, Senior Director, SRE at Red Hat

All Episodes

Narayanan Raghavan leads the global SRE organization that runs Red Hat managed cloud services including OpenShift Dedicated, Azure Red Hat Openshift, Red Hat OpenShift Service on AWS, and Red Hat OpenShift Data Science among others across the three major cloud providers: AWS, GCP & Azure. We start with a high-level discussion about DevOps, SRE & platform engineering, and then we dig into SRE specifics, including what it takes to safely roll out updates across many tens of thousands of OpenShift clusters.

Featuring

Sponsors

SourcegraphTransform your code into a queryable database to create customizable visual dashboards in seconds. Sourcegraph recently launched Code Insights — now you can track what really matters to you and your team in your codebase. See how other teams are using this awesome feature at about.sourcegraph.com/code-insights

RaygunNever miss another mission-critical issue again — Raygun Alerting is now available for Crash Reporting and Real User Monitoring, to make sure you are quickly notified of the errors, crashes, and front-end performance issues that matter most to you and your business. Set thresholds for your alert based on an increase in error count, a spike in load time, or new issues introduced in the latest deployment. Start your free 14-day trial at Raygun.com

RetoolThe low-code platform for developers to build internal tools — Some of the best teams out there trust Retool…Brex, Coinbase, Plaid, Doordash, LegalGenius, Amazon, Allbirds, Peloton, and so many more – the developers at these teams trust Retool as the platform to build their internal tools. Try it free at retool.com/changelog

Notes & Links

📝 Edit Notes

Chapters

1 00:00 Welcome
2 00:54 Sponsor: Sourcegraph
3 03:58 Intro
4 08:21 The 5 Principles
5 11:42 Red Hat
6 13:07 Look forward to your Mondays!
7 16:20 What DevOps means to me
8 18:58 DevOps with SRE
9 22:01 Thinking post-SRE
10 24:33 Sponsor: Raygun
11 26:40 DevOps and SRE mindsets
12 31:00 You gotta pick one role here
13 33:20 SRE in context of Red Hat
14 37:01 Those one-off bugs
15 45:50 How do companies adopt SRE
16 48:52 Failing is ok
17 50:29 That painting
18 51:37 Failing with a small blast radius
19 53:49 Sponsor: Retool
20 55:04 Tech to help SRE
21 57:53 SRE, but back then
22 59:29 Looking forward to 2023
23 1:01:03 Red Hat next year?
24 1:03:09 These people! Come work at Red Hat!
25 1:04:10 Wrap up and key takeaways
26 1:06:52 Outro

Transcript

📝 Edit Transcript

Changelog

Click here to listen along while you enjoy the transcript. 🎧

And Ryan, welcome to Ship It.

Thank you. Glad to be here.

We had a small conversation before we started recording, and your question was so good that I wish we started recording earlier. So I want to try and redo that, because I think your answer was amazing, and it was one of the simplest things… What is your top of your mind?

Top of my mind is how do we scale a team, especially a team that operates a fleet, a large fleet across three different hyper scalars, with Azure, Google and AWS - how do we keep that fleet in sync? How do we keep our people in sync, how do we grow our people in that same environment? And how do we make sure that we’re operating it in a consistent manner every single time? So for me, scaling our people, our processes, our technologies - that’s top of mind.

So this doesn’t sound like a fresh thought; this sounds like something big that you’ve been thinking about for some time now… Something that’s really complex, and will take a while to get to a good answer. Am I right?

Absolutely. Our journey has been six years in the making, and we’ve come a long way. We’re in a very interesting position; I’d consider what we do as one of the only companies out there that’s providing a fully managed offering with OpenShift across the three cloud providers, and from an individual perspective, from an engineer perspective, that actually becomes pretty attractive. If you think about it, as an engineer, I get to work on not just OpenShift and Kubernetes, I get to work on the three big cloud providers as well.

What does that even individual look like? Because it sounds like they have to be a fairly special person, I think. They have to be knowledgeable, they have to be self-driven… So many attributes come to mind.

So this is an interesting question, because for me, I’m not looking for the perfect individual. When I’m going out there and hiring for SREs, I’m not looking for the perfect fit. Because if I hire for the perfect fit, that person is going to get bored, and he or she is probably going to look for other opportunities three months down the road. The person I’m looking for is someone who has potential; potential to learn, potential to pick up new things, is very flexible… And when I say flexible, mentally flexible, and eager to go explore, whether it’s exploring technologies on the AWS front and OpenShift, or going and working with upstream communities. And someone who can communicate, and communicate effectively.

Because everyone’s remote, That makes a big difference.

Exactly. Everyone’s remote. And the technical skills matter as well, but in many ways, I can spend the time and invest in that individual to get them up to speed on the technology front. Now, granted, you need the basics, obviously. Basics with Linux, basics with object-oriented programming etc. But once you get past the basics, I’m looking for an individual who can operate up and down the stack; I’m also looking for an individual who can actually empathize with customers. And I think that’s a very important attribute in this day and age.

[07:54] Yeah. I think remote, if anything, made the human contact more important, the empathy more important, because we’re no longer there in person, and a lot gets lost. The nonverbal communication - there’s so many cues that the rest of your body that you normally don’t see, gives away to others; and not having that - it’s really difficult to know when someone is tense, for example, when someone’s uncomfortable, because you just see your face, right?

100% agree. And I think this is where the team culture comes into play as well also. Within my team, we’ve got five principles, the principles that uphold our team culture, so to speak, First and foremost, it is okay for us to fail. I can’t tell you the number of times I’ve gone and apologized to somebody, a team or a customer, because we messed up something, acknowledging that we’re humans, acknowledging that we’re going to make mistakes, acknowledging that we’re imperfect beings… That is very, very important for the team, because that’s what promotes learning.

The second big principle for us is assuming positive intent. It’s easy to say it, but the example I give my team is, sometimes not everything can be shared; there’s going to be confidential information, but trust that I’m doing the best I can for the team, trust that my management teams, and their managers etc. they’re all doing what’s right for the team, for the company etc. And if we have that kind of an attitude, then we can actually go about doing what’s important, that we need to be individually focused on.

The third principle is starting with trust, and extending that trust. Related to assuming positive intent, but more so around encouraging curiosity, asking questions, not making assumptions.

The fourth principle for us is disagree and commit. We’re in the technology space, and we’re going to have multiple different ways of solving a particular problem. It’s okay; let’s make sure there’s no analysis paralysis. Let’s disagree, commit, pick a solution, pick a path, and if that path is not right - guess what? These are bits and bytes; let’s rearrange the bits and bytes, and go figure out what the right approach is.

So giving people that freedom, giving people the ability to fail, and giving people the ability to learn, more importantly, from those failures - that becomes important. And the last thing is communicate. And communication goes both up and down the stack, and it happens everybody, So I’d rather over-communicate than under-communicate. And I tell my teams, if you step on somebody’s toes, just because you’re over-communicating, that’s a good thing; it doesn’t mean you’re trying to take over my turf etc. It’s a good thing, because we’re all trying to cover and work with each other, and that benefits the larger org.

I’m sensing a lot of experience behind some very simple things… Like, if you listen to them, they may sound simple, but they have been hard-earned. And that is a correct word. You’ve earned them through a lot of situations. Do you know which is my favorite expression that you’ve been using a couple of times? I think it’s the essence of the person that you are. “It’s okay.” Because you’ve seen enough to know that it doesn’t matter what happens, it’s okay. We’ll figure it out. You’ve been through this enough times to trust that we’ll figure it out, and it’s going to be okay. That’s a big one.

Okay. I would like to put for our listeners a little bit into perspective what you’ve just said. So you’ve been with Red Hat for over 15 years now. That’s a long time. And you’ve been pretty much in the middle of the IT industry; you must have seen and you must have been part of many changes over the years. And yet, there is one constant - Red Hat. So I’ll start there. Why Red Hat?

[12:07] I’d summarize that in one sentence. I look forward to Mondays. Period. Full stop.

Why? [laughs]

Because I can go back to work. I’ve been with Red Hat for 15 years because I continue to learn. I’ve been fortunate enough to have had different opportunities, worked on different technologies, different teams. And this is true for every individual within Red Hat. So for me, I summarize it when somebody asks me “Why Red Hat?” I basically say, “I look forward to Mondays.” When that statement is not true, I need to go find something else. Because for me, the looking forward to Mondays is the passion that I bring to the table. Because if I don’t have the passion, if my team doesn’t have that passion, we should probably be in a different role, different business, different environment.

Okay. Wow, that’s a good one. Now, I’m sure that over those 15 years there must have been Mondays when you have not been looking forward to them. How did you negotiate those weeks, or what happened when you didn’t feel like Monday, for whatever reason? What happened afterwards? Because then there were Mondays that you were looking forward to. So something must have happened there.

So I personally, as a person, I derive a lot of my energy from the people I work with. Again, going back to the team principles I talked about, if that wasn’t true, if I don’t trust the people I work with, my experience at a company, any company for that matter - that fundamentally changes. I can create that environment as a manager where people actually enjoy coming in on Mondays. Yes, we will have our open ups and downs. Yes, there will be outages. Yes, we’ll have some failures. That’s occupational hazard.

Let me guess… It’s okay? [laughs]

It’s okay to fail. But again, that’s occupational hazard. I’m in the business where things are going to fail. Fine, we’ll fix it. We’ll learn from it. It’s software. There are going to be bugs. You cannot expect software with zero bugs. So once you reconcile with that fact, I can ruin my Monday, or I can still be passionate about it, because I’m working with some awesome, cool people, really smart people… So why not?

Okay, interesting. So would you describe yourself as an optimist?

I would probably say that, yeah. Sure.

Okay. Okay. I see where this is going. Okay. Okay. That is very important. I know that not everyone is, and some people, especially with your experience, tend to get jaded. Now, you’ve seen every which way of failure, every which way of complication… And if you’re not an optimist, you’d think “Ah, things were better in the past. Things are too complex today. There’s too much change; things are like too quickly changing. It’s accelerating.” And I think the attitude makes a huge difference.

It’s true. And I think as a leader, it’s also important, because the attitude I bring to the table is going to be reflected within my teams. And if there’s one thing that I’ve learned - you cannot give away what you don’t have. So if you’re expecting positive attitude from the rest of the team, and you don’t espouse it, you cannot give that away to people, right? So you cannot give away what you don’t have. So from a mental – mentally, I need to realign how I think, how I approach it, because if my mindset is not in the right place, then I’m not doing justice to my teams.

[16:02] Now, I know that’s you’re currently leading the global Red Hat SRE org, that manages OpenShift dedicated across the three major cloud providers. We already mentioned that - AWS, GCP and Azure. So before we dig into specifics, I would like to keep it high-level, but still meaningful to you. So with that, what does DevOps mean to you?

At a high level, DevOps is cultural change, a movement, so to speak. Now, I can also call it an interface, where you have developers and operators working a lot closely with each other than they have traditionally in the past. It’s trying to define a set of practices, a set of processes, if you may, obviously, implemented in different ways across different companies, to bring people together, to bridge this so-called wall that exists between Dev and Ops. Dev wants to go fast, ops wants to focus on stability… DevOps is a way to bring that empathy that I was talking about earlier; it’s a way to make sure that developers are not just building a product and throwing it over the fence, they understand what it really means to actually run it. Operators, on the other hand, understand what it means to build a product, and why is it important to put out features etc.

I see when you said operators, you meant developers, right?

Yeah, engineers, or system administrators.

Okay. Okay. How does SRE – well, actually, before I ask that, what does SRE mean to you now?

So SRE, or Site Reliability Engineering, is basically focusing on scale. It’s focusing on reliability. It’s focusing on safety. It’s focusing on making sure that we’re building systems that – I think Google would call it building systems that are automatic… But building systems that can scale, can self-heal, but you’re doing that so you can actually balance outages and incidents that might happen against the innovation that’s required to keep your business running. It also brings in a data-driven mindset, with SLOs and error budgets, so you’re not emotional about a particular topic. You’ve got data to back things up. So I think that becomes important, too.

Yeah, that’s a good one. Yeah. I don’t think it went down. I know, it went down. I know why it went down. I can find it, and I can understand it, I can explain it so that everyone understands what happened. Yeah, that’s an important one. Okay. Okay. Is there a relationship between DevOps and SRE in your mind?

I think the relationship - you can look at it in two different ways. SRE could either be a implementation of DevOps - great. But SRE is also an evolution from the land where – so DevOps evolved because you’ve got… I’ve got hundreds of microservices that are in production, need monitoring, need care and feeding, etc. No single person can actually manage it, and this shift from the big monolithic to the microservices world went from DevOps to the SRA world, right? So the DevOps world back in the day started with – you’ve got the big monolithic applications, and you need people to actually come together to understand what’s happening, and then the microservices architecture started to break out those big monolithic services into tons of microservices. And when that happened, it quickly became apparent that no single individual can actually keep up with it. How do you actually manage a fleet of services, at scale, reliably, in a way that makes sense for everybody, both from a business perspective and from a security reliability perspective?

[20:18] I think the evolution of DevOps – what was it, 2007? And then this shift to SREs, right around the time when Google put out their book etc. - it was more from, at least in my mindset, from a realization that you’ve got big monolithic services that’s starting to change to microservices; the microservices world is exploding, because you’ve got everybody building and putting up microservices… How do you manage that? How do you manage that at scale?

And then a way to actually tie back - and this is one of my favorites with the SRE model, is being able to tie back using data, so everybody in the organization, not just your SRE team, but the developers, the product managers etc. can understand the state of the system at any given point in time. And I think that fundamentally becomes a lot more important as well.

Another analogy I’ll say with SRE is everybody has skin in the game. You’re not just looking at the different layers of the cake, so to speak, you’re looking at the entire stack. So just because the networking layer failed, the networking team was always blamed on, almost always… Just because the networking layer failed doesn’t mean it’s not my problem. It’s my burden, everybody’s burden. It’s my responsibility to make sure the business is successful… And how do I make the business successful? I make them successful by making sure the entire stack is up and running, versus, you know, my layer is up and running, and you know, it’s not my problem, I’m walking out.

Yeah. That’s a good one. Do you think that this holistic view, this experience-driven view, is continuing post-SRE? And I’m thinking UX, UI, feedback to improvements… So it’s not just about the service that’s out there, but the service that could be and maybe isn’t, because certain things are missing. Do you think there’s something post SRE, that includes more people?

Yeah, so an interesting thing that I’m starting to see happen within my own teams, within the company etc. is we’re all realizing that as we have more services coming up - and I’m seeing this in the industry as well - as people are realizing there are actually some building blocks; there are actually some key capabilities that every service needs… Things like observability, monitoring, alerting, secrets management, every service needs. The list of those building blocks is actually fairly broad. So the evolution from – we’ll still need SREs, but the need for platform engineering starts to come up. How do I put out a set of capabilities, a set of building blocks that are common, consistent across all of my internal teams? How do I make it so they can plug and play into those building blocks to accelerate the pace with which they’re developing services?

Yeah, that’s a good one.

And I think that’s the evolution. Platform engineering can be people from SRE, it can be people from development teams. And I say that because SREs get to see a broad spectrum of things. They understand how the interdependencies actually come to life… So they come in with a certain mindset into this space, and then we also see developers who are looking at this from a pure development perspective to also come into this space. So I think it’s a good mix of people and characteristics that they bring to the table, that is going to make up that space.

Some years ago we had the DevOps engineer; that was like a very trendy role, very trendy title to have… Like, “I’m doing DevOps.” “What is DevOps?” “It depends…” Right? That’s how conversations start, many good ones, and also bad ones… Can someone be a DevOps engineer, and an SRE engineer, and a platform engineer? What do you think? How do you see those different mindsets, I think…? Because it’s not even roles. I mean, maybe SRE, that’s like a bit more clear-cut, and platform engineer as well. But the DevOps one I think it’s, as you mentioned, more about the culture; it’s more about the mindset, less about the specific implementation.

So I almost want to draw a Venn diagram with overlapping circles… But there’s a little bit of an overlap between DevOps and SREs, there’s an overlap between SRE and platform engineering… And I think it’s important to recognize it. It’s important to recognize it because depending on the team, depending on the company, depending on the products that the company has etc. no one person can (it’s just not humanly possible) remember and keep everything in their head. We should be way past the hero worship; we should be way past knowledge in silos. If we still have knowledge silos, then we have a different problem.

Now, assuming that’s not the case, no one person can wear all three hats. They may be capable; it doesn’t mean it makes sense, both from a work/life balance perspective, or even from an ability to actually do justice to the things that start to matter.

So for example, somebody in the DevOps role, I would want them to focus on making sure that they have good CI/CD practices, good build pipelines, release hygiene etc. Somebody in the SRE role, I would want to see the ability to create systems that can scale, the ability to make sure that you’re not snowflaking your entire fleet, and making sure that the changes that you push out from a fleet perspective, and the capabilities that you put out actually work for the entire fleet, and being able to think about security in the same way as you think about reliability.

From a platform engineering perspective, this for me is being able to think purely from a building blocks perspective, to say, “What are some common things across all my engineering teams, all of the development teams? And what is it that I need to focus on to make sure they are successful?”

Now, some of those common pieces might land with an SRE team. Some of those common pieces, like internal developer tooling, for example - it doesn’t make sense for it to land with an SRE team. There are services that you might use in production that need to scale, that might land with an SRE team. There are services that you need for development purposes for my day-to-day job; it might land with a pure platform engineering team. So at least in my head, that’s how I separate it out.

Do you see yourself as an SRE person, a platform person, or a DevOps person?

[30:07] I see myself as a person, as a manager, as a people manager, that I can understand the space that I’m operating in. Obviously, I need to be passionate about this space I operate in, and be able to make sure that I can message that to my teams, to my stakeholders, customers etc. So personally, for me at least, it’s less about, you know, I’m a platform engineer, or an SRE, or a DevOps engineer… It’s that I can empathize with what the business wants, how do I translate it into systems, processes, and software components that actually matter to the business?

So high-level. Very high-level. And you must have basically all perspectives, and not just these; beyond those as well. What about people that, for example - as you mentioned, it’s very difficult, maybe to the point not even recommended for one person to be doing all three things… Do you think it’s important for people to realize where or what they enjoy doing the most, so that they’re most effective, and they’re also most happy in their roles? What do you think?

I think in many ways it’s a no-brainer, it’s a rhetorical yes. An emphatic yes kind of response. “If you’re not enjoying what you’re doing, if you’re not passionate about what you’re doing, why are you doing it?” kind of question. And I say that especially in the software field, engineers put their hats on, and their tunnel vision, and they’re off into a lot of fun things, and it’s important for us to enjoy it, because I’m spending more time with the code that I’m writing, with the monitor that I’m staring at, versus significant others etc. So it’s important for that reason; it’s also important because the more empathy that you build towards the business needs, your customers, you’re actually putting out a product that actually has all that built-in. Customers will actually enjoy working with that product, because the person who’s built it is pouring his heart and soul into it, and then they understand the customer’s use case, and they’re going to build a fantastic product. So I think they’re very much interrelated. And for me, going back to the hiring comment, finding someone who’s passionate, finding someone who wants to learn, finding someone who’s eager, and who questions things, and who puts me in the hot seat and says, “Why did you do that?” Those are good things. I’m never going to say, “How dare you question your manager?” It doesn’t work. I want them to question me, because I don’t get it right. I’m not perfect. It’s okay to fail.

We keep coming back to that, right? How approaching and seeing failure as an opportunity to learn just opens up so many other opportunities that you would miss if you had a different approach to it. Okay. So as we’re starting to come from a high level to a lower level, let’s stick with SRE, because I know that’s in your title, and I’m assuming that you really enjoy that stuff. What does SRE mean in the context of Red Hat?

[33:34] So in the context of Red Hat, SRE, especially with managed services, first and foremost we are - Red Hat as a company - putting out managed services, with OpenShift as a managed service. We have a first-party offering with AWS called Red Hat OpenShift service on AWS. We have a first-party offering with Microsoft called Azure Red Hat OpenShift, and we have OpenShift dedicated as well. So we are actually running OpenShift as a managed offering for our customers, in their accounts, in the customer’s AWS account, in the customers Google account, in the customer’s Microsoft account etc.

Now, it’s not just the platform; we’re also running other managed services, whether it’s our Kafka service, or our data science service. So for us, and for me in particular, what SRE means is “How do I build and scale a team that cuts across all of these products that Red Hat is putting out, and all of these services that Red Hat is putting out in the cloud, and being able to manage it in a consistent fashion?” Because running thousands of these services - it can quickly go to different extremes, where you’re customizing one service by hand, and doing something else with a random script etc. All of that - again, going back to my Top of Mind comment - how do I make sure that we provide a consistent experience to all of our customers, in a way that it can also scale for us as a service provider?

And how do you? How do you do that?

Well, with good hygiene, right? With good systems automation in place… A lot of self-service, kind of going back to the building blocks… As SRE teams, we observe a lot of commonalities across different services, that we start to wonder “What if we expose these services to not just our internal teams, but the customers? What if we expose it to our partners?” How can we enable our entire ecosystem - not just our customers, but our partners and our cloud providers? We’ve got deep relationships with our cloud providers; I’ve had cases where we’ve contributed code back to the cloud providers, we’ve contributed code to our partners, we’ve contributed code to upstream communities… But being able to do that across that entire spectrum - that makes a big difference. It makes a big difference, again, from an impact perspective as an associate; that is pretty impactful. It makes a difference from a customer perspective, because the experience that the customer has running their workloads on OpenShift is pretty seamless… Because whether they go to a third-party vendor, or their own homegrown workloads, the fact that those workloads can run seamlessly on OpenShift makes a big difference. So it’s no longer about a – it’s never been, and no longer is about a naked cluster, so to speak. It’s about “How do we enable those workloads on top?”

One thing which I’m very curious about is – because you have first-hand experience with large production workloads… How do you deal with issues that only exist in a specific environment, because some specific things, like the planets align a certain way, and it’s like a one-off? Like the Heisenbug - it only happens in that specific environment. How do you deal with those things? Because there must be quite a few at your scale, with all the different production systems that you run?

[37:35] That’s a great question. I’ll give you a specific example. So about a few years ago, we actually ran into a kernel bug; the issue manifested itself on AWS and nowhere else. Just on AWS, on their M4 instance types. We were able to quickly get loaded on it, and identify that it was a kernel issue… And we actually went back out to the community to the kernel upstream community and the Red Hat kernel engineering team, to actually show them that this is a bug, and it manifested itself at scale in this case, in a peculiar way, on Amazon, on the M4 instances. Two things happened right after that. Now, we actually ended up working with Amazon to let them know this was hitting us, so heads up, it might be hitting other customers etc. The second part was working with our kernel team to go, “Does it manifest itself in other places?”, but for some weird reason, it hasn’t. Come to find out the upstream community in this particular case, they hadn’t run into that particular bug. And when we came back with “Here are steps to reproduce it, here’s a test environment that you can try it on” etc. I think it started clicking with engineers that are way smarter than I am to say, “Yes, this is a hard bug, and it will manifest itself on all cloud providers. We just have been lucky that it hasn’t showed up yet.” So we were able to patch it, fix it, roll out the fixes before customers even realized that there was an issue. So for us, going back to how we manage the fleet - we do not want a snowflake fleet.

That’s very reassuring.

Exactly. And so everything we think about, every solution we put out there, we ask ourselves, “Do our fleet benefit from this? Do our partners benefit from this? Do our customers benefit from this?” and we push it out to the entire fleet.

The reason why that’s so reassuring is because – in my experience, a few examples… When you go with a bug to a vendor, and they say “We’ve never seen this one before.” And that usually means only one thing. The resolution will be very long. It may never happen. You will most likely lose patience waiting for it and you will just move on… Or forget about it, or work around it, or things like that. It’s very reassuring to know that you have a different approach, and you’re able to roll it out before customers notice…? Now, how long does that mean time-wise? Either not looking, or you’re really fast. Because it can mean both things.

So it’s an interesting question, because it depends, to be honest. But before I answer your specific question, I’ll have to walk you through how we do upgrades, because this ties into that. We give our customers the option to upgrade whenever they want to upgrade. So in fact, we give them three options. We tell them, “If you want to upgrade now, here’s the big, red, easy button. Click it. If you want to schedule your upgrade on December 31 at 3am UTC, have at it. Or if you want to automatically upgrade your cluster every time there is a new version of OpenShift available, you can actually set it up like your smartphone, so every time there’s a new version that’s available, you pick the day and time, the cluster will automatically upgrade during that day and time.”

Now, when that happens - and most of our customers set it up for automatic upgrades, which is great… But when that happens, one, when there is a, you know, CVE, or there’s a kernel bug, or what have you, we inform customers to say “Heads up. This is happening.” Because again, communication, communication, communication. It’s always key. So we make sure that customers are aware that this is happening. Customers have the option then to say “I want to upgrade my cluster right now, because I have an upgrade available. I have a fix available.” But in some cases, not everything requires an upgrade. And if it doesn’t require an upgrade, we patch it behind the scenes and push the change throughout the entire fleet.

[42:01] The catch with either customers scheduling an upgrade, or us pushing a change through the entire fleet, with either one of them, we generally advocate and train customers to say, “Follow best practices. Follow Kubernetes best practices. Follow cloud provider best practices etc”, because that’s going to help them in the long run.

So for example, if you don’t have pods with requests and limits set - well it’s an anti-pattern. Go fix it. If you have a replica count of one - that doesn’t make sense, because you’re gonna have an outage when we do an upgrade. So I actually tell our customers, “Don’t worry about the upgrade. The upgrade can take two minutes, the upgrade can take two weeks. It doesn’t matter. Worry about your workloads.” Because as a managed service, it’s my job to worry about the upgrade, right?

So as long as you’re following best practices, the upgrade will happen. We will guarantee that the upgrade happens. As long as you’re following best practices, your workloads are not impacted; you won’t notice a thing. Because OpenShift and Kubernetes is going to manage scheduling your workloads, moving your pods around etc. So you don’t experience downtime because of that.

That makes a lot of sense, because that explains big, important things are no different to regular things. Or like CVEs, which just basically come up, right? It doesn’t take you a lot of time, you just consume them, and you want to push them through the system as soon as possible, so that everyone is safe, and secure, and all of that. And if you work on upgrade hygiene - because that’s how I translate it - then everything else will kind of take care of itself. If you have a good upgrade system, nothing’s an issue. It will just go through.

Right. And we actually encourage customers to upgrade frequently. Don’t wait for [unintelligible 00:43:58.20] releases. Upgrade the Z-stream patch releases that come out. Because when you’re upgrading constantly, at least – I’ll use this analogy; sometimes upgrades can be a chaos engineering test, in many ways… But if you do it frequently enough, you’ve actually found all the kinks in the armor to say “These are things I need to work on. These are things that will fail during an upgrade.” So that’s a big benefit for workloads, and then doing it frequently enough gives you those patches, gives you this fixes, gives you those features that you’re looking for, that your next hop, the next upgrade is a smaller hop. It’s not a big hop; so the smaller the harp, the easier it is, the faster the upgrade etc.

I think this is counterintuitive, but super-important, because it almost feels like a fundamental. How do you make production more stable? You push out more deploys. Simple as.

That’s it. How do you make a system more stable? You push out more upgrades. And you do it so often that when mistakes do make it out - and it will happen, by the way; not upgrading will not save you from mistakes, it will just make them bigger when it happens… Because you have a good system that everything happens very quickly, you can push your fix before you realize that you have a problem. And that’s one of my favorite ways of dealing with failure. How do you make it safe to fail? It’s okay to fail. You’ll have a fix in no time. Super-simple.

It’s gone full circle.

Pretty much. Pretty much. Okay. Okay. Now, SRE - it’s so much more than a managed service. It’s so much more than a specific technology… It’s, I would say, even so much more than just a process. Now, I think that by now most of our listeners know that SRE is a good thing, and you want to have it, you know, depending on your appetite, depending on what other things you have going on… But it is a good thing. How do companies adopt SRE? What does that look like?

[46:09] That’s a hard question, to be honest. For me personally, I’ll tell you how our journey started about six, seven years ago at Red Hat… It was a new concept, trying to build this space out; when I started, two, three people on the team, and we were trying to build out the SRE function. It’s a new concept. It’s hard. I say it’s hard, because it’s not hard from a technical perspective, I think it’s more hard because of the inertia that exists within organizations. It’s hard because people don’t want to fail. I think bluntly, it’s hard because people don’t want to fail.

So first and foremost, I think what clicked for us is making sure that you have the executive support to fail; to try things, to fail. I keep harping on the failure part, but it is so critical to building that culture that’s important for SREs. That executive support is absolutely vital. And for us, it started with buy-in, first and foremost. So making sure we had the executive buy-in to say “Try it out. Let’s see what happens.” That shifted to support, to say “We’re going to support you with additional budget, people etc.” Now we’re in a state where we’re actively getting executive engagement, right? And that engagement shows up as conversations around SLOs, conversations around error budgets, conversations around what happens when error budgets are breached.

That for us was a first step, making sure you have executive buy-in and support. That was our first step. Once we had that, for us the next piece was “Who do we hire? The type of people that we hire. What are we looking for? Who are we looking for? What should our hiring processes be? What should our interview process be? How do we make sure that we’re hiring for people who are in many ways not the perfect fit?” Again, I want people that are willing and able to learn, because I want to invest in them, so they can invest back in the company.

So your hiring practices, your onboarding practices make a big difference as well. How do we onboard people? Do I throw them right in the middle of the fire? No, that doesn’t really cut it. How do I make sure that we give them the training, the exposure necessary so they are successful?

I think that is a very important point… And I know there have been a couple of hard questions; I promise the next one will be an easy one… But I have to ask this, because it’s very relevant to the point that you’ve just made. How do you build teams where it’s safe to fail? Where failing is an opportunity to learn, it’s not a failure. It’s not a mistake, it’s a good thing. How do you do that?

People often celebrate successes. Well, we should start celebrating failures. Take the team out for lunch when you fail at something. Talk about it. Poke fun at it. Get to know what failed. We all try to be the best we can when we’re at work. But for me, the irony is we’re spending more time with our peers at work than we spend with our significant others. If you ask my wife, she will probably give you a laundry list of “Here are all my drawbacks and all the issues that I have etc.” But we are scared to do that at work.

[49:44] In many ways, for me a team is a group of people, they’re all different people, puzzle pieces that fit well together. And that fitting well means I need to know what your strengths and weaknesses are, so I can offset your weaknesses, so you can offset mine. And for that to happen, I need to be comfortable with you. For that to happen, I need to trust you. For that to happen, I need to assume positive intent. For that to happen, I need to be able to fail, and learn from those failures… Because when you put people through those failures, and they come out of it, you’re gonna come out of it a lot stronger than, you know, anybody expects, or even thinks is possible.

That’s a great answer. I have a follow-up one, but a promise is a promise. What is the story of that Leonid Afremov painting behind you?

I had a different picture in my background… It was actually a 10-by-15 feet space poster stuck to the wall and stuck to the ceiling. I like to kind of sit in the dark, and that big poster stuck to the wall made it even darker, and made it look like I was sitting in a chair in space… It was pretty cool.

That didn’t go well with my wife, and she said “I’m gonna buy you something”, and this is what happened.

Nice. That’s a good one. Okay. Okay. Is this your first one, or do you know about the artist?

No. I actually read up on the artist, but I don’t recall.

Okay. So thank your wife. Thank your wife, for making your room – this is your home office, I’m assuming, right?

This is my home office, yes.

Okay. It’s very nice. It looks very nice, and clean, and even artistic, I would say… Especially that painting, it makes it look very expensive, so it’s very nice. Okay. The follow-up question was around blast radius. I know as SREs we think a lot about resilient systems, how to design and optimize for failure, the systems that fail and components that fail… Is there something to be applied to teams? Is there some way of making it safe for people to fail without creating a lot of problems for everything? I mean, I’m thinking rmrf/ the equivalent of that, of a database which has no backups… You see where I’m going with this. So configs that are pushed to CDNs and they take half the internet down… It doesn’t happen often, but it happens, and then everyone’s affected, and people have a bad day, or many people have a bad day… So is there something to be said about how to optimize for a blast radius that doesn’t take everything down when someone makes a mistake?

Two things: canary releases, idempotency. Both of them are absolutely vital for making sure the blast radius is minimized. As a company, we run some of our own critical systems, including redhat.com, [unintelligible 00:52:48.25] on top of our managed services. We consume release candidates, OpenShift bits, early on in the lifecycle, and actually roll it out to our internal systems, and in some cases, to our critical production systems that are running some of those applications… Partly because we don’t want customers to suffer, partly because we want to make sure we feel the pain first, and we can actually catch it before the product goes GA.

But being able to do those canary releases is important. The other part obviously is idempotency. Being able to do the same thing over and over again, get the same result, and making sure that it’s consistent also goes a long way in terms of how we approach things.

Going a bit more specific - and I’m thinking about technology now, and I’m thinking not just OpenShift, but also the wider cloud-native ecosystem… Are there technologies that you think help adopt, and then maintain good SRE practices?

That’s a great question. Are there technologies to help adopt and maintain SRE practices…? I don’t know if there’s a – at least in my head, I’m not thinking about SRE as a set of technologies to adopt and maintain. So you can’t magically – I buy licenses for a particular product, and I’m magically an SRE team. I don’t think that’s a thing. So my focus is more on the people side, on the culture side, the mindsets that I want to develop, versus anything else.

What about systems which are declarative, for example, or systems which are idempotent by default, or highly functional, which don’t have such, or try to minimize side effects? GitOps comes to mind; making sure that whatever you change, it forces you to have a certain approach towards things, and that just ripples through everything. And I don’t think it always is the technology, but I think it’s like a group…

What I would say is – I’ll kind of go back to my previous comment about… I still don’t believe it’s a technology that enables SRE, by any means. However, GitOps and declarative technologies, products that provide declarative functions - they will help with going back to canary releases; they will help with the blast radius.

So we have a GitOps pipeline that we use, that we also use to manage our canary releases, making sure our internal systems are upgraded first… But those are, in many ways, helping our development teams bridge the gap with what’s running in production, and providing a safe way to make changes in a controlled fashion to roll out those changes over time. Again, from my perspective - does it encourage SRE practices? I don’t see it that way, because for me it’s less about the tools. Will help aid in the practice itself? Absolutely.

Do you think Kubernetes is important to SRE?

I think SRE is independent of technology. You apply that to a particular function, whether it’s Kubernetes, or anything else. Kubernetes is not SRE. Kubernetes is not important to SRE.

[57:52] Yeah. I think many people need to hear that, because they forget. Many people that – well, at least some people, I’m sure, that listen to this, they remember RRD databases; they remember Munin, and Cacti, and… Remember that? And Graphite… Oh, my goodness me.

I haven’t heard of those in a long time…

Oh, yes. Blasts from the past, right? Especially running them, nevermind using them. So I think some of us many years ago were doing what some call today SRE. It didn’t exist as such, it didn’t have this label… But that human element is just more important today than it was in the past. Not that it didn’t exist; systems weren’t as complicated. Systems weren’t as numerous. We didn’t have phones with six CPUs, okay? We didn’t have laptops with six – that wasn’t a thing. So things are a lot more complicated these days. And for that, practices which in the past were good - they’re even more important.

100% agree. Absolutely agree. It really boils down to the soft skills, it boils down to culture, it boils down to people. In this day and age, more than anything else, like you said, things have gotten complicated. So in a complex environment, navigating a complex environment, you require some soft skills. Now, technology will aid, but it’s not the end-all-be-all, at least in my opinion.

So this is one of the last 2022 Ship It episodes… I’m already thinking about 2023. New projects, possibilities, opportunities, things like that. Me personally, I’m finally ditching spinning discs. I have four eight-terabyte spinning discs, which I’m replacing with SSDs. I’m very fond of my fanless home server, running NixOS… Okay, I’m a geek. I really love my infrastructure, I really love my hardware. Is there something that you are looking forward to in 2023?

I am looking forward to getting some sleep. I have little kids…

Okay… [laughs] Okay, that’s very important.

I’m hoping they’ll grow up a little bit, so they can sleep through the night…

How old are they?

They’re five and three.

Individually, they sleep through the nigh, in rooms that are adjacent, or what have you. If one of them wakes up, for whatever reason, the other magically wakes up at the exact same time. Or, at least my daughter ends up talking in her sleep and wakes me up, like “What are you doing?!”

I see. So sleep. Okay, that’s something that you’re looking forward to. Very important. Okay. It does get better. It does. Mine is 12 years old now - past; 12 years and a half - and it does get better.

Fingers crossed!

But you’re right, for the first six… Yes, yes. No, definitely. Okay. It will be better. Cool. What about OpenShift? What about Red Hat? What about your team, the things that you do? Anything that you’re looking forward to in the next year? Any interesting trends, any interesting things?

Yeah, so from a managed services perspective, we’re on a pretty interesting journey in terms of - you know, we started off small, we started off, for all intents and purposes, word of mouth; we’re now at a scale where we’re growing at a pretty fast clip… And so we’re getting a lot of customer demand. And that includes demand for things like compliance. So we’re entering the – you know, we’re now PCI compliant, HIPAA compliant, ISO, SOC-compliant etc. We’re in the process of being FedRAMP compliant… So I think the whole compliance space next year is something, believe it or not, I’m looking forward to now. A lot of people look at compliance and say “Oh my God, it adds a lot of processes” etc. For us, it’s actually become – it’s a validation of our systems and processes, because we have enough automation in place, we have good controls in place that I’m actually going, “What else could we be doing that we’re not doing?”

[01:02:19.11] So for example, we actually said, “Do we really need SSH access to the underlying OpenShift clusters, to the underlying nodes? Or can we manage it through the Kubernetes APIs and through our automation?” So we’re trying to experiment with things like that and say, “Can we scale it even more? Because we’re already collecting enough telemetry”, and we’re doing enough analysis that we understand what are the potential areas, opportunities for us etc.

So, for me, that’s an exciting space, getting into the AI/MLOps kind of space. That’s fascinating and interesting, so that’s something I’m looking forward to.

Okay. That’s a good one. Okay. I know this is something important to you… You don’t expect me to ask this, so let’s see if I’m right… What are the type of people that you think would enjoy working with you at Red Hat? Who are you looking for to join you?

I am looking for people who are passionate. I’m looking for people who are willing to be challenged in different ways. I’m looking for people who are open to being black and white about things, right? “This fails, this sucks, this is broken. Here’s why.” I want to know and hear about it, so we can actually course-correct, and fix it, address it etc. And I’m also looking for people that help each other out. It’s about the team, it’s not about an individual. So I’m looking for people that can be part of a team, build a family here. So it’s important for us to keep it that way, and make sure that the people we hire are brought in with similar ethos. That’s important for us.

And that’s coming from someone that’s been there for 15 years plus. As we prepare to wrap this up, is there a key takeaway that you would like the listeners that stuck with us all the way to the end to take away from this conversation? If they were to remember one thing, what would you think that things should be?

First off, I don’t know if I should say “I’m sorry that you’ve listened all the way through the end…” [laughs]

No, no, no. I had a great time. I’m sure a lot of people had. And by the way, if you stuck to the end, you must have enjoyed it, right? [laughs] By the way, we have chapters, so it’s okay. People might have skipped to the end. That also happens. Now we have chapter support, so you know, if there is a portion that you didn’t find interesting, just go to the next chapter. It’s okay. It’s not a problem.

It’s fair. I think for me one of the key takeaways is my earlier comment about you can’t give what you don’t have. That applies to people, that applies to technology, that applies to who we are as human beings. For me, in my mind, I think that’s key to how I see things personally. If I don’t have something, obviously, physically I can’t give it away, because I don’t have it… But also from a leadership perspective, to say, “If I need to be a good manager, how do I come across to my team to make sure that they take something that’s positive more so than anything else? What is it that I want to give?” I think for me that’s an important perspective to have, whether you’re growing/building an SRE team, or a DevOps team, or a platform engineering team etc. It kind of dives into “It’s okay to fail.” As I learn things, I’m going to fail, I’m going to learn from it, but I also want to give it away, so somebody else can grow with it.

And yeah, I’ll just say that we’ve kind of hit on “It’s okay to fail” a million times today, and I want to make sure people can give away failure, too.

Yeah, I think that’s important. And I think we tend to take ourselves way too seriously, in this industry especially. It’s not life and death, by the way. Think about the doctors, think about the law enforcement, think about firefighters. So it’s going to be okay; yeah, it’s going to be okay.

Yeah, burnout is not working.

Oh, yes. That as well. That’s a good one. Narayanan, thank you very much for a great conversation. I had a great time. It was very relaxing. Thank you very much. I really appreciate the easygoing one; it feels like a great one to end the year on. Thank you very much, and see you next time.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

  0:00 / 0:00