Most of you already know what it’s like to work in a startup or a small company. A few of you have been asking us for conversations with engineers that work for big companies, the kind that run everything from big title games to banking, and even critical national infrastructure.
In today’s episode, we talk to Ganeshkumar, a Software Engineer in the Azure Kubernetes Service team, who works on Node Lifecycle and Kubernetes Versioning, and Brendan, Kubernetes project co-founder and engineering Corporate Vice President of Microsoft Azure OSS and Cloud-native Compute. We talk about what it’s like to work for Microsoft, how mentoring works in practice, and what Kubernetes, Omega, & Borg have to do with it all.
Matched from the episode's transcript 👇
Brendan Burns: Yeah, I think one of the things – Ganesh mentioned the upstream team, which is another team in my organization that focuses on engagement with the Upstream open source project… And I think in order to do a good job of both understanding how releases happen, and also potentially influence how releases happen, we have to be engaged. And we’ve had members of my team be the release leads for the open source project; not for AKS, but for the whole Kubernetes open source project. It’s a totally thankless job effectively, of like herding all of the cats of this giant project into a release… But that means that we have an intimate understanding of not just what each release looks like, but also how the broader release is evolving. And recently there was a slowdown from four releases a year to three releases a year… Effectively a reaction to the broader community saying like, “Oh my gosh, we cannot keep up with this pace of change.”
I think the developer community as well, the internal Kubernetes developer community as well sort saying “We need to slow down. We can’t just keep jamming more and more code into this thing.” But I think the real difference that I see in releasing Kubernetes versus releasing it for AKS is exactly what Ganesh is talking about, which is… You know, for AKS a lot of what “at scale” means, or at hyperscale means, is incredibly diverse customer workloads… From large-scale machine learning batch jobs, all the way through to real-time serving telephony, even like teams calls. And the upgrade has to work for every single one of them. The upgraded Kubernetes has to work for every single one of them. And it’s not even just about the workload, sometimes it’s also about like what API features did they decide to use?
[42:08] And one thing we learned early on in the Kubernetes project is no matter how much you call it beta, if it’s stuck around for two or three years, you may as well call it GA, because people will have treated it like it’s GA, and you will have set the expectation, because it hasn’t changed… And the minute you change it, it causes amazing ripple effects. And frankly, you can’t – once you have a certain number of users, you don’t have the option of saying like, “Well, but we said it was beta, and you’re all broken. Good luck.” That doesn’t fly in AKS really, at a certain scale, because it’s the principle of least surprise, I guess, at some level. Like, if you haven’t touched it in two years, people are going to assume that it’s stable, because it was stable.
So I think that’s the real distinction that is important for all of the Kubernetes providers, especially for Azure, because that’s the one I worry about is “How do we get that rock-solid reliability so that when the person presses the button, or when the Event Grid that Ganesh was talking about triggers, and someone automatically upgrades, it works?” And then tracking also. We keep track of the SLO for that upgrade, to make sure that we actually are validating it, and that we are achieving it. And sometimes that involves actually going back into the release and finding fixes, and Ganesh mentioned, carrying patches to help while you’re upstreaming those patches, and things like that… As well as, of course, something that Ganesh didn’t mention, which is making sure that also we handle CVEs, and we get notifications as a provider actually in front of the CVE release, because we’re on the embargo list… And so we can ensure that our customers are patched and secure on day zero of a vulnerability, and that they can either choose to upgrade, or in some cases, they’ll receive an automatic upgrade, kind of depending on the severity of the security issue.