Ship It! – Episode #68
Behind the scenes at Microsoft Azure
with Brendan Burns & Ganeshkumar Ashokavardhanan
Most of you already know what it’s like to work in a startup or a small company. A few of you have been asking us for conversations with engineers that work for big companies, the kind that run everything from big title games to banking, and even critical national infrastructure.
In today’s episode, we talk to Ganeshkumar, a Software Engineer in the Azure Kubernetes Service team, who works on Node Lifecycle and Kubernetes Versioning, and Brendan, Kubernetes project co-founder and engineering Corporate Vice President of Microsoft Azure OSS and Cloud-native Compute. We talk about what it’s like to work for Microsoft, how mentoring works in practice, and what Kubernetes, Omega, & Borg have to do with it all.
Honeycomb – Guess less, know more. When production is running slow, it’s hard to know where problems originate: is it your application code, users, or the underlying systems? With Honeycomb you get a fast, unified, and clear understanding of the one thing driving your business: production. Join the swarm and try Honeycomb free today at honeycomb.io/changelog
Sourcegraph – Transform your code into a queryable database to create customizable visual dashboards in seconds. Sourcegraph recently launched Code Insights — now you can track what really matters to you and your team in your codebase. See how other teams are using this awesome feature at about.sourcegraph.com/code-insights
Notes & Links
- 📄 Borg, Omega, and Kubernetes - March, 2016
- 🎬 How to use GitOps with Microsoft Azure - Brendan Burns - July, 2021
- 🎬 Kubernetes: The Documentary - Part 1 - January, 2022
- Event Grid on Kubernetes with Azure Arc
|4||03:05||What's it like to work in a big company?|
|5||07:43||How did you guys end up together?|
|7||13:47||To talk to Brendan|
|10||21:50||Brendan's athletic history|
|12||28:43||How did Kubernetes come to be?|
|13||32:49||The box for the puzzle|
|14||34:51||What about Ganesh?|
|15||37:28||Thoughts on the K8s release process|
|17||44:00||The last few K8s releases|
|18||48:00||How long it takes for a new version|
|19||49:02||What else about AKS?|
|20||55:00||The Azure event grid|
|21||57:24||Why is K8s so imporant?|
|22||1:02:18||What else besides K8s?|
|24||1:07:53||Key take aways?|
Click here to listen along while you enjoy the transcript. 🎧
Son Luong Ngoc, and I hope I pronounced that right - one of our listeners, he mentioned that he would very much be interested to “hear from the builders who made things happen and established big enterprise companies.” And I’m quoting Son. So we have two guests today, which I think can help with that. Welcome, Ganesh, and welcome, Brendan.
Thank you for having us. Really excited for the conversation today.
Great to be here. Thanks for having us.
So Son, one of this questions was “What is it like to work in a big company?” And I’m looking at you, Brendan… What is it like to work from Microsoft?
Sure. I think that the most amazing, positive thing that comes from it is just the opportunity to impact the world everywhere. I often say that effectively, no matter what you’re interested in, whether it’s gaming, or aerospace, or human rights, Microsoft is involved in it, in some way. And so I think it’s very easy, both to find your passion, but also to – maybe your passion is breadth impact; for me, actually, it is really just opportunity to empower every single person in the world. There’s only a few places where that’s true.
What was the last thing, or the last moment when you thought “Wow, I contributed to that”, and there was like a real-world implication of something that you contributed towards? When was that last moment?
[04:04] Positively or negatively?
Up to you, I think they’re both relevant… [laughs]
Maybe both, I think that things like the Forza 5 launch - we were involved in helping that. It became game of the year. It’s running on the Azure Kubernetes service. Tons of people – obviously, that’s a touchpoint for them, for their kids… That’s awesome to see.
Similarly, doing the Covid pandemic, when we saw various health-related services that needed to be spun up quickly and were being spun up on the platform…
Similarly, in the negative of course, every time we have a customer-facing service interruption, we hear about the real-world impact, and I think it’s really important that we tell those stories, because I think it helps people understand that we don’t just do this because we want the quality number to be at a certain number… We do this because people’s lives depend on us doing a good job. Or there’s real-world impact when we have an interruption.
So I think there’s both a responsibility component to the kind of impact we have, but also just kind of an awesome component of like – I’ll be at a party, talking to somebody who really doesn’t care about what I do at all, but is really excited about Forza, and then I can connect it in, and now maybe they’re a little bit excited about what I do.
as I was listening to you, I was still stuck on that, Forza Horizon 5?
It’s a great game. It is a great game.
Real story… My 12-year-old now, in 2015 we got the first console; it was an Xbox. It was a Forza Horizon special edition. And it remains our favorite game. So when Forza Horizon 5 came out, I pre-ordered it months in advance. I could hardly wait for it to come out… And you contributing to that makes me feel even better, makes it even more special. So thank you very much for that, Brendan.
Ganesh, what is it like for you to work for Microsoft?
I think for me it’s been really exciting, and a great opportunity to even learn about how it is operating in a large company. I think also the cool stuff, like what Brendan mentioned, about being in a team that contributes to Forza, or platforms and companies like Forza use, is quite satisfying. And even when I share it with friends or people who are in high school, or in college, who are not familiar with Kubernetes and cloud computing, the thing that everyone’s excited about is Forza as well, in my experience.
I think working at Microsoft in particular I feel like has been very aligned with my own personal aims. I personally want to make a large-scale positive and tangible impact on people. Microsoft’s mission of empowering every person and every organization on the planet to achieve more aligns very well with my personal aims.
And even working as part of this product of the Azure Kubernetes service product helps me contribute to improving the efficiency for developers around the world, which I find to be quite satisfying. Let’s say we make it easier to upgrade to new Kubernetes versions, or improve the developer productivity through integrations like AKS and Event Grid integrations - I think it gives me a lot of satisfaction to know that we’ve contributed to that, and people are finding it easier to use as well.
We will unpack that a bit later… I still want to keep focusing on the people, because I’m fascinated by that, and I think that is the first and foremost top of my mind… Like, what is it like from a people perspective? And I know that you and Brendan work together. How did you end up working with him? that’s a very interesting story.
[07:49] So Brendan is the leader of this large organization, and even for me personally, I had interned twice at Microsoft. In my first summer, it was in this different team, part of Azure, but it was not in Brendan’s org. And in that team, I was learning about distributed systems and cloud computing, and that was my first exposure to these areas. And my manager, mentor, another engineer on the team, would share about recent developments in the field, and one of the things that they shared was this paper called Borg, Omega and Kubernetes. And it turns out that Brendan was the co-author of that paper. When I read it, I was quite impressed by what I’d learned. Then when I found out that he was the co-author and he was working at Microsoft, I was very excited by it and I wanted to learn more about Kubernetes, it seemed quite interesting… Which made me switch into the Azure Kubernetes service team the next summer, which is part of Brendan’s org.
So that’s how I became involved in this space. And even in the internship, for instance, Brendan would organize one-on-ones and office hours with interns, which I thought was just amazing, given how senior he is, and how many people he leads. And in that process, I was also able to learn about how he thinks, and how he envisions the future of cloud-native computing to evolve, and that’s how I’ve been learning from his experience.
Okay. So Borg, Omega and Kubernetes - that will come up again, I’m sure. Still sticking with the people… What is your perspective, Brendan, on working with Ganesh? How did that come to be from your end?
I think that one of the most important things that I can do is to connect with people as they start to be part of Microsoft, or part of my organizations. It’s trite, but there’s no time like the first time to help make an impression, and help make sure that people have an understanding of what we’re trying to do, and what the culture is that we’re trying to set… But also to get that fresh perspective, I always say that I want the person who really has no expectations, or who comes in and doesn’t know how things have been, or isn’t used to how things have been - that fresh perspective is extremely, extremely valuable to me… Both for how the organization is working, and also frankly how our product is working.
So that person who maybe has never used Kubernetes and comes into the AKS documentation, tries to make it work, maybe gets some of it working, but is like “This part over here I really didn’t understand” - that’s critical feedback for us. And I think especially, the interns are really great because they give you very unvarnished feedback. They are not shy about sharing where they see problems, and things like that, and I think that’s just a tremendous source of information for us to do better.
So I really view it as me helping make the organization and helping make the product that we have better. I think it’s probably the most important usage of my time, frankly.
Wow. That sounds spot on. honestly, even for me - if I was in your position, I could not do it better; focusing on the people, focusing on the connections, focusing on that fresh pair of eyes’ perspective. That is so valuable, and I’m very glad that you see it the same way too, because… Invaluable. Untarnished. You’re like “What is this?” And the excitement, the joy, the discovery… And you’re in your best place to play with things, going back to Forza 5… Very important. Homo ludens. Very important. We love to play. And just like everything is a game, and everything is exciting… That sounds amazing.
What made you pick Ganesh specifically? Did you pick him? How did that happen? A set of coincidences…? How did you two end up working together?
[11:58] Well, I think actually the truth is that I really try and make these opportunities available to especially the interns in my organization, to do office hours and to do an exit interview with all of the interns in the organization… Because also, I think that, at the end of the day, these are the people who carry back what Azure is to their colleagues, and to their friends, and to their campuses. So they come, they’re at a place, they separate out, and some go to Google, and some go to Meta, and some go to AWS, and some come to us… And then they come back together. And I’ll say, selfishly, I want everybody to be jealous of the people who decided to come to Microsoft, and who decided to come to Azure. And I think there’s a lot of good reasons for that. But I also want to check in and make sure – like, if they didn’t have a great experience, why not, and what did we do wrong? But also, I wanna have a chance to check in and help set up a connection for the future.
I would say also one of the things that’s important is I don’t necessarily – well, I love that Ganesh came back to Microsoft and to the AKS team. I also want the people who go off and do startups and the people who go off into other companies - I want them to choose Azure. At the end of the day, they’re gonna have a choice of cloud providers, and they’re gonna be in a company where eventually - maybe at the beginning, or maybe later on - they’ll have a position of influence, and I want them to choose Azure. And I think the only way that I can actually make that happen is by understanding if we’re doing a good job.
So it’s not a unique thing, I don’t think… It’s not a unique thing even for the interns. I have office hours every month, and whoever wants to in the organization, or even sometimes beyond the organization, can sign up and we just have a one-on-one, and talk about whatever they wanna talk about.
Okay. How can people that are curious about this maybe – are there some public links? Is this open to anyone? How can they access that?
Currently, it’s Microsoft employees…
I would say if you’re curious about my thoughts, and you’re out there and you’re not in Microsoft, I’m obviously up on Twitter, and you can hit me up on Twitter, and I’m happy to discuss things there. I discuss non-tech-related things, too. I’ve posted some fascinating stuff that I learned about the cleaning of Ballard Locks, which is a big thing in Seattle… One of the lakes connects to the sea, and it was – I just discovered it this morning, and I was like “This is fascinating. Other people should learn about this.”
I have to check it out. That’s the one thing which I haven’t checked. I haven’t checked your Twitter, but I will have a look, and maybe even add a link in the show notes.
Yeah, yeah, yeah. And a recent post-mortem of me making hot water instead of coffee for the household.
That one was pretty funny… I like how he did the RCA
I had some fun with it, I have to admit.
Okay… So I think it’s universal. We are all nerds, in different ways, and we will always take every opportunity to nerd out. There’s something that you mentioned, Brendan, and I want to go back to Ganesh, because I think it’s important… Why did you choose Azure, Ganesh?
For me, I think related to what we were saying earlier, working on something that has a very large-scale impact is something that I personally enjoy doing, and for a very long time as well I’ve been wanting to work on projects which have this large-scale impact. And Azure is used by organizations around the world, like we talked about, and the changes we make there and the improvements we make that have significant cascading effects downstream. So that’s been very satisfying for me to see… Whether we are able to fix security issues quicker, or improve productivity in different aspects; it really trickles down to developers, and eventually to users who are not even technical. So that’s been very satisfying for me to be part of.
[15:50] And I think the other aspect that I like about Azure is also learning about how all these distributed systems components are connected, and how different layers of the stack come together to actually create a great product for users, and how as a large organization we continue improving upon that, as well. So I think being part of Azure gives me exposure to a lot of it, and I’m able to learn and contribute to those aspects. And through my internships, I was able to see what Azure was doing, and what it could do, and I think that’s why I wanted to continue working on Azure.
I will go one step further, because I really liked how Brendan talked about it, where there’s all these options, young engineers, they can go to Meta, or AWS, or Google, and some pick Microsoft… So going back to that context, why Microsoft? What made you pick Microsoft?
I think it sounds a little cliché, but I do really resonate with the mission of the company, and also my experience as interning. I essentially did six months of internships in Microsoft, so I knew what the organization was like, what the team was like… I had inner, day-to-day experience working with them, and I was confident that Microsoft was doing the right thing. A lot of things that Microsoft does are very transparent, both internally and externally, which also resonate with my personal values. So I think those aspects were quite important in helping me make this decision to return to Microsoft full-time. So that’s primarily why.
For all our listeners, what is that mission statement that resonated with you?
Yeah, I think Microsoft’s mission is to empower every person and every organization on the planet to achieve more. And I see this throughout the organization, where people are aligned with this mission, and they want to contribute to products and features that make it easier for developers and users to build their own products and use/run their applications easily, and so on. So that part – I see that even the actions align with the missions. I think that’s great to see, especially as an intern, when I could observe it over six months, during the internship. I was happy to see that the day-to-day experience was aligned with the stated mission and values.
Okay. So I know that Brendan is your mentor, and I’m wondering, what does it even mean to be a mentor? Because I have had people approach me, and then wondering whether it’s a personal thing, or whether – what is it to be a mentor to you, Brendan? I think this is a very important learning lesson for me right now, so I’m all ears.
I think it’s a combination of things, I think some of it is to kind of help people learn from your own mistakes… Like, if you have everybody making the same mistakes over and over again, everybody eventually arrives at the same place, but it’s kind of inefficient…
Does it mean that you’ve made a lot of mistakes, and you have a lot of wisdom to share?
I’ve made a lot of mistakes, oh yeah. For sure. For sure. I think all learning at some level is based on making mistakes, So I think that’s part of it. Like I’ve spent a long time in the industry, I’ve seen a lot of things… I think the other part is the historical perspective. For me anyway, a lot of times when I’m thinking about how do I make decisions in the technical world, I’m casting back over my own experiences, whether it was Linux versus FreeBSD in the late ‘90s, or other kinds of moments in the tech industry, and trying to find the analogy, so that I can help explain today in the context of the pas. And I think sometimes people – like, if you didn’t live it, you don’t have the same understanding as if you’re reading about it, or anything else like that. So I think part of the mentorship is also just giving people that historical perspective, but also giving them a place to bring concerns or questions.
[20:07] To be honest, I think a mentor – while it’s good to have a mentor in your organization, sometimes it’s also really good to have a mentor who’s outside of the organization, who you can really bring anything to. That’s also, I think, valuable, that independent perspective. So I think all of those – in my mind, it’s not that different from being a teacher, it’s just a little bit more of a focused… Maybe like a focused one-on-one discussion, as opposed to anything else.
But I also would say I let people bring their own agenda. I’m always really clear with people, like “I’m here to be of service to you.” I don’t give assignments, or anything. I don’t think that’s mentorship, really. I think it’s like, you bring the agenda, you bring the questions, and then we’ll see where it goes.
Okay. Do you have a mentor?
I have at times had mentors. Recently, I would say that my mentorship experiences have been a little bit more one-off around specific things, like, “Hey, I really want to work on this particular aspect” and someone says, “Hey, this person is really good at it” and I go… I would say we role-play through discussions, where I’m like, “I don’t think–” Like, we had a discussion, and that thing, it came out more – I heard back through the grapevine that someone really thought that I was being too overbearing, or I was being too micromanagy. And being able to role-play through that with someone else, and being like, “Okay, here’s what the situation was, here’s what I did. What would you do?” and then just listen to their… It’s kind of like watching somebody else do a sport, or play a – well, actually not for me. When I watch other people do sports, I see things that I couldn’t ever do. But fortunately, in the tech world, when we talk through it, I’m like, “Oh, actually, I think I could do that.” So I think that’s valuable.
Brendan high-jumping. I cannot imagine that.
Oh, my God… That’s so cool. It’s so cool, but I’m not able to do that. I did actually pole vault. I used to pole vault.
Not very well. I was a distance runner. I still am a distance runner. I was a distance runner in high school and college. But in high school, I really wanted to pole vault also, and so I convinced the coach to let me pole vault, and I pole vaulted. And then in college every once in a while we would need a couple extra points, and the coach would be like, “Hey, you can pole vault and you get over the minimum height, and we can get a couple of – you can finish last. There’s few enough pole vaulters that even if you finish last, you’ll still get us a couple of points, so…” You go where you’re needed.
Okay… That sounds like a setup for success. Like, he knew what he was doing. He was a good coach. Talking about mentorship…
He was eking out every last point.
I heard Brendan has records for running, and he challenges people in his org to read it…
I do. We have a giving campaign every year, which is one of the things – you talked about Microsoft and the culture… One of the things that I really value in addition to empowering everyone to do more is that sense of Microsoft as a member of the community. And Microsoft Give every year is a focus on philanthropy and giving back. And I think that is huge, and it just resonates so well with me personally. But as part of that, there’s a charity 5k. And so there is a standing offer within the org than anyone who runs faster than me in that 5k, I will donate a significant amount of money to the charity if they were to beat me. And actually, one person did. So there was a guy – he’s no longer in the org…
That’s why he’s no longer in the org. [laughter] He beat you. He beat Brendan. “You’re out, sir.”
No, he left voluntarily… But yeah, there was a guy actually, and he was really fast.
What is the time to beat?
So it’s a 5k distance. Last year, I went just under 19 minutes.
Just under 19 minutes. Okay.
18:59, okay. Flat surface?
[24:02] I mean, it wasn’t a track, but it was a relatively – it was around a lake.
Okay. It’s been virtual. So the one on the Microsoft campus, which they’ve done in the past, is a little bit more hilly. Last year, because of the pandemic, it was virtual. So I went to a lake near my house, and actually a few of us got together and did it.
Okay. 18:59. So dear listener, that is the time to beat. If you can beat Brendan, go and talk to him. Alright…
Plenty of people do. And I’m old at this point. That’s not my fastest ever. That’s just my fastest as a 46-year-old.
Right. But what is your fastest ever?
My fastest ever is 15:09.
15:09. Okay. Now, that is a good one. Okay. Wow. How old were you when you did that? Do you remember?
I was senior year in college, so I was 22.
And that was on a track.
We’re very competitive. I know I am, so I will look into that. That’s what I can do about that. Ganesh, what are your thoughts?
I have a feeling that some of your listeners are gonna listen to this and they’ll join Microsoft so that during give campaign, they can attempt to beat his record. So…
Mission accomplished, Brendan. Mission accomplished.
You have to join my organization though…
Right. There you go. So very important. Alright.
I think that’s gonna be likely – someone’s gonna do that.
So I had a great time talking about like this – focusing on the people, focusing on the culture, and I hope that our listeners enjoyed it too up to this point. I think now it’s time to maybe switch to technology. That’s what I’m thinking. And we will start with something that Craig McLuckie said in the Kubernetes documentary. By the way, I watched both parts. It’s an amazing one. We talked about it in the past, and I think this is a good time to rewatch it. I’ll put the links in the show notes. And Craig McLuckie says, and I quote, “I think that Brendan is a creative genius. If you spend two days with Brendan, he will throw off ten ideas, any of which could actually change the world.” So Brendan, how did the Kubernetes idea came to be? And what happened after it was shared with the world?
Yeah, so I think credit really is due to Joe and Craig and myself together. We were working really closely together on things, and it was all three of us kind of being contextualized. I think the context was we were really thinking about – like, I had built a bunch of distributed systems, and then because of some reorganizations, I had moved into the public cloud organization from the search organization… And I suddenly started experiencing how people were building applications and deploying applications onto virtual machines, and it was really, really primitive and hard. I guess maybe primitive is the wrong word, because it’s super-negative. It was just really hard and flaky. Like, you’d do a really good job building a deployment engine, and it would work 85% of the time, 90% of the time. You know, not even a single nine kind of percentages… And so we just knew that there were better ways to do this.
[30:28] And simultaneously, Craig and Joe had done the – they had built the virtual machine infrastructure, but they were looking at “How do we help move beyond this, effectively?” And we knew that containers were an essential component of this, but to be completely honest, we didn’t know how to make people understand that containers were important, and get them to adopt containers. And so a lot of our initial conversations were about that part of it, of like “How do we get people to be into this mindset?” And what happened was Docker came along and just convinced everybody for us.
And so I think an important part of the Kubernetes story is to say that without Docker, I don’t think there’s a Kubernetes. They broke ground here in terms of getting people interested and thinking about it. And so we saw that in the early days, and we started tracking the open source project on GitHub, and paying attention to the community… But I think that what we saw as it was evolving was that we could see into the future, because we’d lived it in the past. Or another thing we said at the time was it was a little bit like everybody had all these puzzle pieces that they were randomly trying to put together, but we had the puzzle box that had the picture of what it’s supposed to look like at the end.
I have this vivid memory of going to the very first Docker meetup in Seattle… And it was in probably November of 2013. And exiting that meetup – and I’d come to understand the community, and all this kind of stuff… And I ended up basically doing a tutorial, like an interactive tutorial on Docker and containers during the meet-up, because there was just a ton of interest and excitement. But nobody really knew – like, they knew they should be excited, but they weren’t quite sure why.
So leaving that, I just had this incredible sense of this opportunity to build something that would take all this excitement and could transform the way that the industry was moving. And we knew that it had to be open source; Docker was open source. We knew that in order to be successful, it had to be open source. And so that was a whole other aspect of it. And I came back from that meet-up that night to the office the next morning, and Joe and Craig and I sat near each other, and we just talked through it, and what we could do, and started out with a demo to kind of gain support and illustrate what it could do… Really just kind of glued together a bunch of existing open source components for the first POC demo… And I just went from there.
What was the puzzle box for you? Because you mentioned containers, they just convinced the world for you… But the world didn’t have the puzzle box, didn’t have everything else around it. They knew containers were important, they understood Docker, it was simple, but there was way much more than just that. What did that look like?
It was like the developer-oriented API. We’d been used to building cloud APIs, but all of the old cloud APIs were all infrastructure-based. It’s all like virtual machine, virtual disk, virtual network, virtual – they’re not developer-focused, they’re infrastructure-focused. And so we said, first of all, Docker and containers as they’re currently talked about are kind of really focused on a single machine. But every application that we know of, and that everyone wants to build, is a multi-machine application. And so we have to provide something that kind of abstracts away from the machine, and gives an orchestrator view. And I think for some people in the community at the time, they said, “No, no. We’ll just give them the same view. It’ll just be across multiple machines, but it’ll kind of look like it’s one machine.” And we said, “No, from experience we know that that jump from a single machine to multiple machines - you need to introduce new abstractions. You need to introduce replication, and rollouts, and service load balancing, and all of these components that aren’t really present on a machine, but are present in that orchestrated layer, and you need to build that piece of software that’s going to do the orchestration.”
[34:19] And so things like the pod, things like what we called a replication controller originally, but became ReplicaSet, and eventually deployment, and the service load balancer, became like the core ideas that we were trying to express and that we were trying to talk about. And then that , you know, RESTful, distributed, resilient API was also an important part of it… Again, kind of getting it from being a daemon that runs on every machine to being a service that runs across a bunch of machines.
Okay. What about you, Ganesh? What was it like when you started with Kubernetes? Because I’m sure your perspective was very different to Brendan’s.
Yeah, definitely. So for me it was quite a new paradigm for thinking about software as well, and even thinking about how code is run, and infrastructure in the cloud - it’s very different from what you do in college. Most of my courses were focused on running code on your machine, and when I was doing some training of ML models and so on it was using cloud computing and running on different machines, but not at the scale at which Azure operates, or a lot of production-grade software operates. So even a lot of the analogies that Brendan used and comparisons he did with previous technologies - I did not have that as a reference. So Kubernetes for me was also the first time in which I was learning about many of these concepts, around deployments and resilience, and so on.
So there were two parts, I guess. One part was even before Kubernetes, in my first internship at Microsoft, I got exposure to how rollouts are done, and so on, without using Kubernetes. And then, in my second internship, I had exposure to AKS and Kubernetes when I was interning in the team.
So for me, learning about Kubernetes I think has definitely been something where I explicitly spent time learning from docs, and actually even Brendan’s videos on YouTube, on Kubernetes, I watched them. For me it was also cool to actually meet the person behind the videos, too. So that was nice.
And I think also working on different projects within AKS has helped me learn about various Kubernetes concepts. And one of the things that I worked on is Kubernetes versioning for AKS, which is supporting new versions of Kubernetes on AKS. That process, especially for minor versions, involves many API deprecations, and flag changes, like Kubelet flag changes and so on that we would need to handle… So as part of debugging Kubernetes code there, looking at all these different components within AKS and so on, actually helped me become better at getting a framework of Kubernetes, and models of Kubernetes in my mind. And I think it’s still an ongoing process; there’s so much in this space, so it’s been a good journey.
That’s very interesting. What are your thoughts on the Kubernetes release process? The versioning… Because I know there have been some recent changes, it’s now being signed… A lot of changes around the release process, and it’s very complicated, because it’s such a complicated piece of software, so many contributors, so many different components to it… How do you view the release process as an end user that then has to help curate it for AKS?
Yeah, that’s a great question. And just as a background too, even when I joined AKS full-time, I did not think too much about how new versions of Kubernetes come up in AKS. I sort of thought that they just showed up. I was very naive in my thinking there. But then I learned about how oftentimes you need to make changes internally to actually support that.
[38:18] So like you mentioned, in this case I’m more of a user of what the upstream Kubernetes release teams do for versions. I guess the parts that are helpful are all the release notes, and documentation, and they provide high-level summaries of what changes are going to be available… I think that’s quite helpful, even in the process of making changes in AKS. Because especially in AKS, there are so many edge cases, and different scenarios that we need to address for customers, because there are so many different customers. So getting a good overview of the major changes in minor versions is helpful.
I think the part that I’m still thinking about and perhaps could be improved as well is in terms of making it easier to handle some of these API deprecations and flag removals, and so on. Some part of it is manual, even though there’s like various mechanisms to figure out whether an API version is being deprecated, and so on. I think some part there perhaps could be improved. And I know it’s a complex process as well, with all these fixes coming in, and some of it going through batch versions, and so on… And AKS internally also has additional backports of hotfix patch versions to previous patch versions and so on that only AKS [unintelligible 00:39:43.24] supports. So the overall process as well is like a lot of moving parts, and it’s been interesting to see how that plays overall. And for me too I think it’s a great opportunity to collaborate as well with the upstream team within both Brendan’s org, and also in the broader community, to make sure that the main features and changes that they’re doing can be fronted to users through AKS.
What are your thoughts, Brendan, on the Kubernetes, on the current Kubernetes release process?
Yeah, I think one of the things – Ganesh mentioned the upstream team, which is another team in my organization that focuses on engagement with the Upstream open source project… And I think in order to do a good job of both understanding how releases happen, and also potentially influence how releases happen, we have to be engaged. And we’ve had members of my team be the release leads for the open source project; not for AKS, but for the whole Kubernetes open source project. It’s a totally thankless job effectively, of like herding all of the cats of this giant project into a release… But that means that we have an intimate understanding of not just what each release looks like, but also how the broader release is evolving. And recently there was a slowdown from four releases a year to three releases a year… Effectively a reaction to the broader community saying like, “Oh my gosh, we cannot keep up with this pace of change.”
I think the developer community as well, the internal Kubernetes developer community as well sort saying “We need to slow down. We can’t just keep jamming more and more code into this thing.” But I think the real difference that I see in releasing Kubernetes versus releasing it for AKS is exactly what Ganesh is talking about, which is… You know, for AKS a lot of what “at scale” means, or at hyperscale means, is incredibly diverse customer workloads… From large-scale machine learning batch jobs, all the way through to real-time serving telephony, even like teams calls. And the upgrade has to work for every single one of them. The upgraded Kubernetes has to work for every single one of them. And it’s not even just about the workload, sometimes it’s also about like what API features did they decide to use?
[42:08] And one thing we learned early on in the Kubernetes project is no matter how much you call it beta, if it’s stuck around for two or three years, you may as well call it GA, because people will have treated it like it’s GA, and you will have set the expectation, because it hasn’t changed… And the minute you change it, it causes amazing ripple effects. And frankly, you can’t – once you have a certain number of users, you don’t have the option of saying like, “Well, but we said it was beta, and you’re all broken. Good luck.” That doesn’t fly in AKS really, at a certain scale, because it’s the principle of least surprise, I guess, at some level. Like, if you haven’t touched it in two years, people are going to assume that it’s stable, because it was stable.
So I think that’s the real distinction that is important for all of the Kubernetes providers, especially for Azure, because that’s the one I worry about is “How do we get that rock-solid reliability so that when the person presses the button, or when the Event Grid that Ganesh was talking about triggers, and someone automatically upgrades, it works?” And then tracking also. We keep track of the SLO for that upgrade, to make sure that we actually are validating it, and that we are achieving it. And sometimes that involves actually going back into the release and finding fixes, and Ganesh mentioned, carrying patches to help while you’re upstreaming those patches, and things like that… As well as, of course, something that Ganesh didn’t mention, which is making sure that also we handle CVEs, and we get notifications as a provider actually in front of the CVE release, because we’re on the embargo list… And so we can ensure that our customers are patched and secure on day zero of a vulnerability, and that they can either choose to upgrade, or in some cases, they’ll receive an automatic upgrade, kind of depending on the severity of the security issue.
Yeah. So based on the metrics that you have, are the Kubernetes releases going in the right direction? Are things improving? What do the metrics say?
Well, I guess what I would say is like we have always and we continue to do a lot of work to make sure that AKS upgrade is extremely stable. I think the Kubernetes releases themselves are pretty stable, but our customer base, the diversity of our customer base is just not something that goes into those releases.
Have you had any surprises in recent months - or let’s say this year - that you weren’t expecting, something that astonished you or surprised you?
No, I don’t think it’s anything you’re not expecting, necessarily. Maybe Ganesh has a different perspective… I don’t think it’s a question of like what you’re not expecting, it’s more of like something slipped through. Like, something slowed down if you’re running 10,000 node clusters, or something starts using more memory… It’s little stuff, it’s not like, “Oh my gosh, they deprecated this API and nobody knew they were going to deprecate it.” Like, that kind of stuff is really well documented, and all that sort of thing.
I think it’s much more of the like, “Well, if you’re gonna run tens of thousands of clusters for different customers in different environments, there will be edge cases, and not every single one of those is going to get fully vetted and tested.” But I’d be curious to hear Ganesh’s perspective on that.
[46:00] So while the release notes I think do a good job in terms of giving an overview of what the major changes are, with new minor versions… I think sometimes there have been what felt a little unexpected, like port changes, which ended up affecting some component in AKS… And I think a lot of other engineers in AKS, before I even joined, had created this great suite of end-to-end tests which are also catch many of these issues. So for me, in that sense, it was like debugging it with other people to find out what the root cause of that issue was, so that AKS catches those issues before customers even get it.
We even had this internal call with a customer, and I was able to ask them if they had seen any surprises with API changes because of new Kubernetes versions, and he had mentioned that he had not experienced it. I think that goes to speak about AKS itself handling those issues for them.
And I think that also goes back to the initial point, where I was talking about how we are improving productivity for many users who run on AKS by working on these changes by ourselves. Because if you were to be running your own Kubernetes clusters, I think you’d have to worry more about these feature flag changes… Kubelet had these flag deprecations in K8 1.24 so you had to think about all of that… Versus if you are in AKS, you don’t have to think about that ideally, because we’ll handle it for you.
And because so many users use ASK, I think the net impact of us handling it also saves them a lot of time, which is quite satisfying to know. And it’s also cool to see people tweet about it, new Kubernetes versions being supported in AKS… Or supporting new Kubernetes versions really fast, soon after its release upstream. So that’s nice.
That’s interesting. How long does it take you to roll out a new Kubernetes version in AKS?
So I think for minor versions, it sort of depends. We aim for it to be soon after it’s released. We also mentioned about how ties with the upstream release cycle, and we sometimes test with RC versions, or alpha and beta versions, even before the point zero version is released for a minor version, so that we can identify many of those changes earlier.
So we really target for soon after its release, but we have this release cycle on the website, where we give a little bit more conservative estimates, which is typically at most a month after it’s released, but we tend to also aim for faster timelines.
Yeah, it’s usually sooner. Okay, okay. Now, I know that AKS really was just like a base layer. There are so many things plugging into it, and I think Brendan you were alluding to it… Like Microsoft Teams… Huge systems that depend on the AKS stability, on the AKS availability, all that. So what other things tie into AKS? Because there’s a lot; there must be. And not just software, but also integrations, also things that just make use of AKS as the starting point.
Yeah, for sure. I think this is one of the places where we do really believe we have something unique. When you look at the broader Microsoft ecosystem, we can connect all the way from VS Code, where somebody’s doing their editing, and we have implemented a Kubernetes extension there that does things like highlighting best practices. So if you don’t put resource limits on your application description, we’ll put a little red squiggly line in there and highlight the fact that you don’t have a best practice… As well as capabilities about introspecting running applications, so that you can really easily and securely connect to pods in your cluster for debugging, collect logs, look at configuration files easily… Kind of really try and provide that streamlined Kubernetes experience. And all that actually is done not just for AKS users, but for any Kubernetes user using VS Code.
And then of course, we also add on and have some really great capabilities for AKS users in VS Code, where people can run security scans of their cluster, get best practices… And that security aspect also shows when we do integration with things like Azure Defender. That’s a different Microsoft product that can work and provide recommendations and scanning of your cluster. Obviously, people use a Container Registry with AKS, and we have a bunch of integrations, both to get images pushed into the Container Registry, but then also to consume images out of the Container Registry. And monitoring solutions that work on top. We’ve had development of the Open Service Mesh, that is a service mesh solution that’s integrated into AKS and supported… Because I think a lot of times when people look at some of these open source componentry, one of the biggest hurdles to adoption is “Who do I call if it breaks? Like, do I need to become an expert?”
As Ganesh was talking about, we’re experts in Kubernetes for people, but we also are experts in the Open Service Mesh, so we can provide that as a supported service. And we’re experts in – well, the monitoring team, which is outside of my organization, but they’re experts in monitoring, and they can provide monitoring for Kubernetes… But also to integrate that with open source solutions, whether it’s Prometheus or Grafana, for cloud-native monitoring.
So I think we absolutely try and take that perspective, that AKS is a piece of a broader toolset, and how do we ensure that people can use those end-to-end; use the great Microsoft ones when they want to, but also be able to plug in other componentry if they don’t. If they want to use a different service mesh, and they’re willing to stand up a team to support it - fantastic. It’ll work on top of AKS. We don’t want to have them forced into any particular choices, but you definitely want to provide them with that streamlined glide path that gets them successful as quick as possible.
I think that’s one of the advantages of choosing Kubernetes - you have all this incredibly rich ecosystem that is available. And by the way, when you want support for specific components - guess what, there’s like a whole org behind it, and it’s like all end-to-end, the integrations, because that we all know, how long it takes to pick the right plugin for VS Code, and make sure it works… Little things like that. It is the integrations that get you - not necessarily specific things, because they can work well; but put together - that’s where surprises happen. And having someone that focuses on that - I think it’s not a thankless job, but it’s an invisible job… Because people are thankful, but it’s like - there’s so much to it.
Echoing what Ganesh said earlier about empowering people and productivity - that’s the stuff that really resonates, knowing that the work that we do just takes away toil and burden from other folks.
I really resonate with a lot of what Brendan said, and also, I think it ties into, Gerhard, the question you posed from your audience member about what it’s like working in a big organization, and a big company like Microsoft… I think it’s interesting to see how different components that are owned by teams, many of which are in Brendan’s org, but also outside Brendan’s org, in other parts of Microsoft - how they come together, so that users are having a seamless experience.
For instance, if you’re going to use Windows containers, that involves code from Windows and other parts, which as a user you may not realize that, okay, maybe it’s in different orgs, and so on, but your experience still needs to be smooth. So seeing how that collaboration happens across teams as well has been something that I’ve been learning, as well as an engineer, early in career at Microsoft.
[53:54] And then I think AKS as well, because it fronts a lot of services in Azure, and in Microsoft, I think has this responsibility in terms of making sure that end-to-end things work well, which is why I think the AKS team as well is very collaborative with multiple teams within Brendan’s org, or outside of it, so that users have a good end-to-end experience.
Even recently, I was talking to a teammate and he has been driving the Azure draft product, which makes it easier for users to basically containerize their workloads. And there’s also GitHub integrations that he’s working on. There’s so many parts of the organization that come together, so that users have good tools to use. And I’ve been seeing it both as an engineer, when I’m on call too, when there’s different components that might be involved within Microsoft, to fix an issue… And also just observing how these collaborations happen, and also driving some of these collaborations for new features.
I’ve heard you mentioned, Ganesh, a couple of times the Azure Event Grid. Can you tell us more about that? Because I only heard about it from you. So tell us more about it.
Sure, yeah. And hopefully you’ll hear about it from more people as well, in the future. So Event Grid is this platform in Azure that in simple terms just makes it easy for you to consume events in various Azure resources. During my internship project in AKS I was lucky to be part of a team with another intern and my internship manager to build the Event Grid integration with AKS.
So what it does is it makes it easy for you to consume events related to AKS clusters. And based on those events, you can create new workflows as well. So the event we’ve started out with is new Kubernetes versions being available for your AKS cluster. So if you are using AKS with Even Grid, you will be notified of this event of a new Kubernetes version being available, and you can create workflows based on that. So you can maybe test out in a test region, or you can just test out your workloads with the new Kubernetes version to make sure everything’s fine, and it’s upgrading properly. It gives you more confidence there.
And also this integration that we worked on provides this platform to make even more workflows based on other events as well. For me personally, it was very exciting too, because this was an internship project initially, and engineers worked on it after the internship as well… So it was exciting, because it was a very user-facing feature, and I was able to see that – it launched in public preview, and I was quite satisfied to see… You can see it on the website, and people can use it. That was also a great example of a very impactful internship project. I think there’s more to come as well; other engineers are working on this integration more, so… Stay tuned.
Okay. Can you share a link with us to put in the show notes? I think that would really help.
Okay. To share it with others that may be interested in this. Okay. Okay. So we talked about Kubernetes quite a bit, and I’m wondering now, going back to the beginning, having all that history, and you being part of it, Brendan, why do you think Kubernetes became so important?
Yeah, I think part of it is a reflection of the realities of a cloud-based distributed system development. And I think that you can – I think that a lot of the important ideas that are in Kubernetes, you can see pre-existed Kubernetes. So I think that it’s almost a by-product of its environment, rather than a transformational – I think it crystallized a lot of stuff that was going on in the industry.
[58:02] Prior to Kubernetes, people like Netflix were talking a lot about immutable infrastructure, but they were doing it with VMs, and they were building a lot of kind of orchestration-ish tools, but they were doing it with virtual machines… And tools like Puppet and Chef and Salt were thinking about orchestrating, but they weren’t online; they were sort of one-time blasted across your infrastructure, and then hope that it stays up. It didn’t have that sort of self-healing kind of aspect to it.
So I think that what Kubernetes did is it really crystallized a lot of the ideas that were bubbling around about how you do cloud-based infrastructure. It added some things in terms of online self-healing, that were in part related to our own experiences with not wanting to get woken up in the middle of the night, and my own experience with control loops and robotics, and balancing systems, and things like that.
But I think, while all of that is important and useful, I think the thing that really made Kubernetes the thing that did it, as opposed to anything else - because I think it’s really important for everybody to remember, a lot of people weren’t in the ecosystem at the time. But there was five or six or seven or eight different systems that were trying to do the same things that Kubernetes did… And over time, over a couple of years that winnowed down to a single solution. But I don’t think that – it wasn’t that Kubernetes was unique in the ideas that it was trying to push forward… But I think it was really unique in building extensibility in from the ground up, and building a really strong, vibrant multi-vendor community, and a welcoming community, and a strong ecosystem. You mentioned the ecosystem earlier, building a really strong ecosystem of other people who were dependent on its success.
If you look at machine learning today, every single machine learning system that’s out there is using Kubernetes. And so that means that every single AI system that’s out there has almost a requirement that Kubernetes be successful. They’re motivated to make sure that Kubernetes is successful, because that’s the framework that they step up from.
So building that ecosystem of people who weren’t necessarily interested in container orchestration, but needed a foundation to build on - I think that eventually becomes a network effect that turns these things into something that has staying power. Because at this point - you know, you could sort of imagine something maybe better or different, but there’s so much invested in that ecosystem that the switching costs are extremely high. I think that’s why we’ve seen it take legs.
I think the thing is because we’ve built an open ecosystem, and a vendor-neutral ecosystem with the Cloud-Native Compute Foundation, no one is super-motivated to try and disrupt it. It’s just way easier to become part of it, and it is a very neutral place, and so that also helps with the stability of it.
Yeah. I definitely do see Kubernetes becoming that basically tide that lifted all the boats. It became the sea, the tide. So - well, you can replace it, good luck, but that’s a lot of effort. It’s a lot of time, it’s a lot of investment… And why do that?
Yeah, and I think it’s a lot of stuff that – I mean, people would rather be working on the next thing. They’d rather be building their exciting application, rather than worrying about the infrastructure. And I think that’s great. I do think it’s really important to know that it really was the ecosystem. I think people say, “Oh, you created this thing, and it was so successful”, and all this stuff, and I really always wanna – I want to say and to be clear about the fact that we sort of started the ball rolling, but it was this community that did it, and it is the broader ecosystem that did it. It’s the Prometheuses of the world, and all of the Cloud-Native Compute Foundation projects that created the actual thing that people use. Kubernetes is a part of it, but it’s the breadth of the ecosystem that is the thing that I think really delivered the staying power and the motivation, and differentiated it, frankly, from all of the other equivalent things at the time.
[01:02:18.01] Yeah. So continuing to think about the ecosystem and about all the other projects that are in this ecosystem - Kubernetes is an important one, but by no means the only one - what are the other projects that you’re using, paying close attention to, finding interesting? I’m wondering about you, Ganesh, right now… What is it?
Yeah, that’s a great question. I think there’s so many projects, and – you know the meme about the Cloud Native Computing Foundation, all the projects there… So I think I’m still navigating that ocean of projects, and –
You haven’t finished yet, right? You’re still figuring out… [laughs]
Yes, yes… I think there’s two projects that I’ve been more interested in learning and using recently. One is around container acceleration, so speeding up container image pulls and starts. That’s pretty interesting. There are some cool open source projects out there in terms of doing lazy image pulling, and so on. So soft of experimenting with those now. That’s been quite fun. And just very recently, my manager and I were just talking about it, and he also suggested about looking into like WASM, WASI projects in that space. So I’m also just sort of dabbling in that and trying to learn more about it. So those are things I’m going to do going forward.
For your listeners, WASM is WebAssembly.
Wasm.dev, yeah. We have mentioned it a couple of times. I find it very interesting.
Yeah. But I haven’t dug into it myself. Why do you think that is important? Because I keep hearing about it, but I’m still missing, like, why is it the important part?
Yeah, I think there’s a couple of different reasons why it’s important. I have some folks on my team who have been – Microsoft is a member of the Bytecode Alliance, which is the foundation that’s coming up around WASM on the server side, and the WASI spec, the WebAssembly Systems Interface.
I think it’s important because one of the things that we hear from our customers is that they want cloud to edge coherency and consistency. They want to be able to take a machine learning model and learn it in the cloud, but bring it down to a $2 IoT device, a microcontroller class device. And prior to WASM, really the only way to do that probably would be to write C or C++. And C and C++ have all kinds of issues in terms of – practically, every language that has come afterwards is a reflection of some of the challenges of C and C++.
So I think WASM is interesting because it has that ability to move from cloud-based workloads all the way down to edge-based workloads. It’s a pretty language-independent sandbox, where you can target it from Python, or the .NET team is doing a bunch of work to target it from .NET, you can target it from Go, you can target it from Rust, and obviously C and C++ also.
And then I think the other aspect of it that is really interesting to me is that it represents an opportunity to rethink what is the minimal systems interface for an application workload. If you look at containers, they just take whatever the kernel gives them. And there’s security features like capabilities and things like that to kind of opt yourself in and out of, but it still sees files as streams of bytes. And it doesn’t understand things like Event Grid, that Ganesh was mentioning earlier. And I think there’s an opportunity to say, actually, in terms of security and in terms of cloud applications, maybe we want a different system abstraction layer, and we want to code against a different systems interface than an operating system that frankly is based on 40-year-old concepts.
[01:06:08.11] Yeah. That’s interesting. Yeah, yeah.
And so I think there’s something there. I mean, I think it’s still very early days and experimental, and we talked a little bit about productivity, and toolchain, and the comprehensive things that Microsoft can do… The developer experience for WebAssembly right now is pretty primitive. And unless you’re targeting it from a couple of different languages, it reminds me of Linux in the late ‘90s, when it was like, “Good luck. Here’s a bunch of docs, and good luck.”
So I think Microsoft has an opportunity to not just look at it from a cloud perspective, but also look at it from a tooling perspective, and make it easier for people who just want to write Python code and have it run in a WebAssembly sandbox, for example.
I think there’s one other attribute that I want to add about the culture of open source within Brendan’s org, and also in AKS. I think AKS uses so many open source projects internally, and when they find issues, team members do file bugs, make fixes, and so on… And I’ve observed that that is part of the culture in AKS, and also overall and Brendan’s org, too. And I think that is something amazing.
I did not really expect that Microsoft would be so open to open source, and encourage so many engineers to actually contribute to open source… And even when I see projects internally, I think there is a question of, “Can we start this as an open source project? How do we contribute and make this open source?” And even when you make decisions about what technologies to use, I’ve observed that there is this bias towards using open source tools and technologies… And I think that’s something pretty cool to see, and what I had not expected initially. It’s a very open culture.
That almost sounds like a key takeaway, because as we were preparing to wrap up, I was going to ask about your key takeaways. So you can still think about yours, Ganesh, but I’m thinking about you, Brendan, as we prepare to wrap up… What do you think is the key takeaway for the listener that stuck with us all the way to the end?
Yeah, I think a couple of different – going back to the beginning, the discussion about organizations, I think that my key takeaway anyway is that as you lead organizations, ensuring that you have a sense of what’s going on in each of your teams, and you provide opportunities for personal contacts and influence, and helping answer questions for people throughout the organization is just critical to building a healthy organization culture. People will surface things in one on ones that otherwise just don’t percolate up to you. And you have an opportunity as well, when people come in, to hear their perspective, hear that fresh perspective, and to influence what they think is important in the organization. And that’s critical for setting culture.
[01:09:01.00] And then I think on the technology side, I think the takeaway would be… And it’s sort of similar, I guess, at some level, that community and ecosystem and building healthy communities and building healthy ecosystems - that’s the key to success. Technology is a contributor to that, but lots of really great technologies have failed because they failed to create an ecosystem. So those would probably be my two takeaways.
Okay. Ganesh, you have one more. Brendan too, you have one more left. Go on.
I get one more?
Yeah, well Brendan had two. So he had people, and he had technology. Ganesh, you mentioned the open source… I’m not sure where that fits in. I’ll let you decide. But I think one more would be perfect. One more takeaway.
I see… A lot of pressure here. [laughs] I think for me as well, if it’s okay, I will also structure it in the people and technology side, and you can decide whether to add it in the podcast or not… But I think one takeaway from the people side is - I think being in organizations like AKS, and Microsoft, where you can learn from others and learn about various technologies is very helpful, especially like early in career… And being proactive in terms of learning is something that I feel has been helpful so far. And there’s a lot more for me to do as well. So that’s sort of one takeaway for me personally in my time at Microsoft.
And then the technology side - I think the overall cloud-native community is quite helpful in terms of learning about various technologies, and I think this space is an exciting place to learn about various tools to improve efficiency and productivity. I see both the managed services that Microsoft provides, and the open source technologies as ways to improve productivity for the entire tech industry.
Those are good ones, so thank you very much for that. My key takeaway is… 18:59. Let’s see if I can beat that first. And then we’ll compare ages, and let’s see what is the actual time that I have to beat to beat Brendan, at my age. That’s what I’m thinking. So thank you for that. That was a very good one, Brendan.
Okay. Well, I had a great time with both of you today. Thank you very much. I wasn’t expecting this, but you made me curious about AKS. And that’s something which I want to check out. Like, it’s been many years since I used Azure. I don’t think AKS was a thing when I last used Azure, that’s how long it’s been… And I think I want to check it out. So thank you both for inspiring me to do that. It was a great pleasure having you here, and I’m looking forward to next time. Thank you.
It’s awesome. Thank you so much.
It was good to have us. Thank you.
Our transcripts are open source on GitHub. Improvements are welcome. 💚