Ship It! – Episode #102

Managing Meta's millions of machines

with Anita Zhang, engineerd managerd at Meta

All Episodes

Anita Zhang is here to tell us how Meta manages millions of bare metal Linux hosts and containers. We also discuss the Twine white paper and how AI is changing their requirements.

Featuring

Sponsors

FireHydrantThe alerting and on-call tool designed for humans, not systems. Signals puts teams at the center, giving you ultimate control over rules, policies, and schedules. No need to configure your services or do wonky work-arounds. Signals filters out the noise, alerting you only on what matters. Manage coverage requests and on-call notifications effortlessly within Slack. But here’s the game-changer…Signals natively integrates with FireHydrant’s full incident management suite, so as soon as you’re alerted you can seamlessly kickoff and manage your entire incident inside a single platform. Learn more or switch today at firehydrant.com/signals

SentryCode breaks, fix it faster. Don’t just observe. Take action. Sentry is the only app monitoring platform built for developers that gets to the root cause for every issue. 90,000+ growing teams use sentry to find problems fast. Use the code CHANGELOG when you sign up to get $100 OFF the team plan.

Notes & Links

📝 Edit Notes

Chapters

1 00:00 This is Ship It!
2 00:52 Sponsor: FireHydrant
3 03:15 The opener
4 16:28 Welcome Anita Zhang
5 17:19 Meta's infrastructure
6 18:34 Provisioning OS
7 20:00 Fedora ELN & CentOS stream
8 21:13 In-house automation
9 22:54 What is Twshared?
10 24:44 JournalD inside a container
11 25:47 Host profiles
12 27:23 Coolest sweatshirt ever
13 28:01 Meta & open source
14 29:35 Frequent releases and 1M hosts?!?
15 30:48 Meta's AI fleet
16 31:43 Production engineer vs Production engineer
17 32:34 Other internal services
18 35:05 OS challenges
19 36:07 One size fits all?
20 37:20 Meta's AI adoption
21 38:09 Cost optimization
22 40:07 Lots of abstraction
23 41:39 Upcoming projects?
24 43:55 Immutable file system
25 45:36 Thanks for joining us!
26 48:37 Sponsor: Sentry
27 52:34 The closer
28 52:48 Faux or Fo Sho?
29 1:02:04 Outro

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Hello, and welcome to Ship It. I’m your host, Justin Garrison, and with me as always is Autumn Nash. How’s it going, Autumn?

Today on the show we have a fascinating topic from Anita Zhang of Meta, who has been there for a little while and done a couple of different really interesting jobs. But right now we are talking mostly about the container scheduler they have, and just managing Linux servers at scale. She’s an engineer D, manager D on the Linux team.

Okay, like how cool is that job title, though? Like, you’ve reached new levels when you can have a cool job title that has Systemd in your job title. Also, her sweatshirt game - unmatched.

I know. Yeah, we saw her, she was at Scale, she gave a talk there, and I reached out and was like “Hey, I’ve read the Twine paper. Do you wanna come on the show? I’d love to talk about how you manage millions of machines. Physical hosts, rolling out Linux.” This is just fun to talk about, because I’m a nerd.

It’s amazing. Well, not just that, but just the fact that they do it all on prem, and they do all of it themselves. And then they are awesome enough to open source a bunch of their cool stuff once they get it working on all of these millions of things, and then they open source it so other people can use it, which is rad.

They work mostly upstream, yeah. It’s like they’re working upstream with the Fedora and CentOS and these communities… And while Twine, their container scheduler isn’t open source, a lot of the other components they have too just –

They contribute a lot to upstream, and I think that is super-rad, when corporations actually make the effort to be good stewards of open source, and actually make real contributions back. That’s really cool.

I mean, of all the things in 2024 that are going away, it seems that open source contributions and diversity are the –

Are the important stuff. [laughs]

…the things that might help long term, and… Wow, it’s been interesting.

You know, like 90% of infrastructure is open source, but let’s totally shoot ourselves in the foot in the future. No big deal.

Yeah. So, great conversation, great conversation today. But as always, we want to talk about a few links, or a couple of links, things we’ve found interesting, that we read, or in my case, I listened to something and I’m going to share a different podcast.

[unintelligible 00:05:26.16]

I listen to a lot of podcasts… And I don’t know if this is allowed on a podcast, to talk about someone else’s… But I’m gonna do it anyway, because I thought it was really cool. And specifically, this was on The Verge’s podcast, or the one that Nilay Patel does, called Decoder. That’s what this one’s called; not the VergeCast. Decoder is the podcast, and this one is about – it’s an interview with the CEO of Dropbox, because that is a fascinating story about how he talks about remote work.

And one of the things that I latched onto in this was he does what Amazon does with document-driven meetings, which I’ve found very fascinating working at Amazon, just the whole, like, you write a big document, and then everyone reads it quietly at the beginning of the meeting. And that is what you do. And if you’re not at Amazon, that sounds completely foreign, and I always thought that that was kind of one of the seeds of one of the best asynchronous and remote work possibilities available.

It was like, you could do a lot with like “Hey, we’re gonna do a lot of thinking upfront to write this document, make it coherent, and make it a little too structured for Amazon’s likings…” Like, some of that could be more flexible, but keeping it structured in a way that makes it easier to consume, so people know what to expect. If you’re going into a one-pager, [unintelligible 00:06:37.16] a six-pager, all of those types of meetings have very structured requirements going into it. And in this podcast, Drew mentions that Dropbox adopted document-driven meetings around COVID time, when they said “Hey, we’re gonna go all-in on being remote.” And I’ve found it fascinating, just the conversation they had, and –

I didn’t know Dropbox was remote. They have been since apparently 2020-ish… 2021. I think it was in the 2020s when they said they announced fully – like, “Hey we’re just going fully remote.” And they definitely leaned into this document-driven culture of like “We write docs to share ideas. There’s no slide decks.” And I thought it was cool… As well as Dropbox being an on-prem company. They are notorious for – they remove themselves from the cloud environments, and shared how much money they saved.

So Dropbox, I’m sure – like, any large company is going to be hybrid, but where’s the majority of their spend? And they are one of the ones that the majority of their spend is going to be on-prem, in their servers that they manage, but still being a remote company. Those two things don’t seem to line up in my head, where it’s like, every datacenter I’ve worked in was an on-site, on-prem environment.

I think your techs and your engineers are different though, right? Like, your engineers and your corporate people versus the techs and the people that are actually doing your hardware are two different components. Because a lot of companies are technically remote, but – well, not just that, but I think it’s also interesting, a lot of people seem to be owning hardware in data centers, but they’re not actually managing their data centers. Not Dropbox specifically, but just talking to like people in industry lately, it seems like the new thing is… It’s somewhere in between owning your own data center, and cloud, but instead, they own their hardware, but they just have it in another data center maintained by other – like, by techs overall, which is like interesting; it’s almost a new offering.

When, and colos have done that for a while, it’s just not been as visible.

Yeah. But startups are doing it now, which is wild, because that was the bread and butter that cloud was really going after, was “Hey, if you’re a startup, you can start here… So in case you don’t know how many servers you need…” It’s that low investment upfront, because you’re just getting started. That was cloud’s bread and butter. And now the fact that they’re utilizing this new way of going about it where you’re on prem, but not your own on prem; I wonder if that will steal a lot of startups from cloud.

At the very least, knowing that it exists, that you don’t have to go to a data center yourself. Like, you could ship a box somewhere, they unbox it, they rack it, they stack it, they plug it in, and you get an IP address. And it’s a slower process than calling an API and getting a machine, but long term a more financially viable, like, “Hey if I know I need this for three, four or five years…”

That’s what I’m saying, it’s a lot cheaper… So it’s like you’re getting to the point where you are making a very viable other option to cloud, you know?

Yeah. So I thought it was a really interesting interview, specifically around the remote work and document-style meetings, as well as being an on-prem environment. Cool listen. If anyone else is listening to Decoder or The VergeCast podcasts, go check it out.

I also think that that shows you how much – like the culture of people moving jobs every so many years. People take things they learned at their previous job, that made their meetings more efficient… And things that are good from companies will transfer to other companies. People will take what they like and what really made an impact, and take it to the new place they’re going to.

[10:13] Yeah, every environment is going to be different. I know plenty of people that won’t work from people from certain companies, because they try to force it too much. They’re like “Oh, well, at my last company we always did this, so we have to do it here.” I’m like “This is different.”

There’s always somebody on [unintelligible 00:10:26.01] who’s like “Don’t hire these people, because they’ll make you do this…”

There’s pros and cons of it, and at least in the interview Drew points out that they adapted it; they’re not doing the Amazon’s strict six-pagers, but they’re like “Hey, we have documents, and we read the documents, and that’s where we share the data.” And it’s not up to a good presenter, or a slide deck, or something like that.

Which makes it really efficient, and I think you have less pressure to take notes too, when you know that you have this document that you can take with you after that meeting, and be able to look back… Or when you have to talk to someone else about the meeting that you were just in, you have a good basis of what you spoke about in that meeting.

If only Amazon had a good document searching… [laughs] My biggest gripe there is it’s in 10 different systems, and you always had to find the PM who did it… HashiCorp had a really good document search engine on top of Google Docs, which is where they did all their documents.

Did you work for Hashi?

No, no. They open sourced it.

Oh. It’s funny though, because it seems like every big company – like, there was a thread on Twitter the other day, and people were like “The thing I have missed most about (I think) working at Google was Scanner”, when everybody said that they hate it Scanner… But like a bunch of people that left said that that’s what they’ve missed… And I will see the things people say they miss when they leave different big companies… And it’s funny, the things that you’re like “This is horrible”, but then you miss it when you’re gone later.

How about your? What’s your link?

My link is the Twine paper from Anita’s team… Which I thought was just amazing, the way that they’re able to scale that many Linux hosts, and the fact that it is on prem, but – the way that she was speaking about updating and stuff… And just - when you read the actual white paper, and see all the tools that they’ve created, and how some of them are open sourced to make it easier… But they built so much of it in house. And they use really minimal, almost – like, they use a lot of things that just come with Linux distributions, which I thought was really interesting, but they also have built their own tools… But it just seems like they kept a really amazing level of simplicity to make it not complicated. I just wonder if it’s – I won’t say easier, but if it helps them to onboard people when you get too Meta… Because a lot of their stuff that comes from Linux, like Systemd and that kind of stuff… That’s awesome, because if you’ve used that at your other jobs, you’re not onboarding to some super-complicated system.

It’s interesting too, because complexity at some point becomes the standard. And when you think of – I always think of cars. Cars are extremely complex systems, and they’re failing all of the time. You have multiple misfires in your engine all of the time, and you don’t ever know about it, but anyone can be taught to drive it. And the pedals and the steering wheel, and some blinkers, if anyone uses them, are pretty standard interfaces, right? And those are things that we can use reliably, and anyone can jump into any car, and it’s like “Oh, I know where most of this exists.” And something like Kubernetes is extremely complex, but it’s like a standard interface, too. Like, “Oh, I know how this works”, because I can jump from one Kubernetes cluster to another, and say like “,Oh I know how this complexity is going to work in some regard. There’s gonna be details and things that I don’t understand”, but that just kind of becomes the easy layer of it.

When I was at Disney Plus, we were building everything on ECS, and we could never find any ECS experts. Like, there was no one out there that knew ECS, unless you worked at Amazon. I’m like “We’re not hiring Amazon people, so how do we get ECS experts to come?” So we hired a bunch of Kubernetes experts. Everyone on the team was a Kubernetes expert, and we’re like “Hey, if you squint, this is kind of Kubernetes.” But there was all these edge cases and things that we had to build around, because it’s like “Oh, it’s not Kubernetes, and we know that we’re not driving a car, we’re riding a bicycle”, or something. It was a different paradigm of how we had to treat the system.

[14:15] But I think that’s almost like a secret sauce. Like, they worked really hard on implementing as simple as possible, while also making tools, but they implemented their own tools when it was necessary to truly make it easier, but they kept a lot of it as simple as possible…

And the fact that they had different goals. It was trying to have a single system that manages a million machines…

Well, and just different kinds of machines, and different workloads… And I think that when you can really strike the balance between simplicity and using standard things and open source things that people have a lot of experience with, so you can onboard and find those people, that you can bring to that team to help you to scale, but at the same time creating enough well-made internal tools to help you to be able to differentiate into the scale up aspect also, because it’s a million hosts; that’s a lot. I think that just having that secret sauce and figuring that out just can make your organization amazing.

Well, and I also have to think that working upstream primarily helps with that recruiting, right? It’s like [unintelligible 00:15:17.16]

Yes. I think that is really underrated, because when you’re contributing upscale, for one, you build better relationships, right? It’s easier to recruit. You’re keeping a lot of what you’re using simple, because you can also – so you can find people that know how to do that job, just like you were saying with the EKS thing. I wish people would see that value, because then we also get more people in the open source community, and this becomes this great cycle of like –it’s those beneficial relationships, you know?

And I still would love to see some white papers from them about the things that aren’t open source, or talked about right now… Like, how they roll out Linux updates, right?

She said some of those take a year, but then they made some of it where they were updating so frequently… It just seems like such a very well-oiled machine, and it was just fascinating. And their white paper is fascinating. I love white papers and post mortems, because you learn so much.

Well, in that case, you’re gonna love the outro for today as well… And we’re gonna just leave people on that, and we’re gonna cut over to the interview with Anita, and talk to you again after.

Break: [16:28]

So today on the show we have Anita Zhang from Meta. Anita, you are an engineer D, manager D is your title. Is that correct?

I think that’s fabulous, as a Linux user and a longtime restarter of services. Tell us about what you’re responsible for at Meta.

Well, I support a team that basically – well, my manager supports the Meta’s Linux distribution. I like to call it operating systems. It sounds better. But we primarily contribute to Systemd, to eBPF-related projects, building out some of the common components at the OS layer that other infrastructure services build on top of.

So you’re the kernel of Meta’s infrastructure.

We have like an actual kernel team to do the kernel, but… One layer up, I guess.

[laughs] One layer above that. So describe the infrastructure, describe the sources. I have been following what Facebook and Meta have been doing for a long time as a Red Hat user at other places, and seeing the upstream contributions… But I know many people to this podcast may not know what that infrastructure looks like, and what you actually do.

Yeah, I mean, we’ve been around a while. We personally – the company owns millions of hosts at this point. A mix of compute, storage, and now the AI fleet. Teams primarily work out of a shared pool. So we have a pool of machines called Twshared, where all of the container jobs run. There are a few services that run in like their own set of host prefixes, but for the most part the largest pool is Twshared. A lot of our infrastructure to support this scale is homegrown.

[18:09] I don’t know anything off the shelf that’s gonna do a million hosts.

Yeah… Me neither.

That’s amazing. So Meta has their own flavor of Linux, I guess?

No, we actually use CentOS for production, all of our production hosts, and even inside the containers we’re using CentOS. Desktops are primarily some flavor of Fedora, Windows or macOS.

And what does that look like for what you’re doing at the fleet level? You’re provisioning the OS, or have some tooling to provision the OS, and from talks that you’ve given that I’ve watched - you had a great talk at Scale, by the way… If anyone wants to see that talk, it’s on the Scale website. But you’re doing upgrades. If I want to upgrade a million hosts, I’m like “Hey, I needed to roll out a new version of the operating system”, that’s gonna take a little while. There’s a lot of processes and there’s a lot of risks there, right? Because you could be causing other things to fail. So how do you do that in a safe way, and at that size?

You know, we’ve gotten a lot better at it over the years. When I started, we were doing like CentOS 6 to 7, and I think that probably took like a year or two to actually reach over like 99% of the fleet. And there’s always that trailing 1% that for some reason they can’t shut down their services, or they don’t want to drain [unintelligible 00:19:23.09] traffic, and things like that. But now we’re able to complete I’d say like 99% of the fleet in a year or less. We started doing a lot of validations sooner, so now we actually hook in Fedora ELN into our testing pipeline, and we start deploying parts of Fedora ELN and running our internal container tests against them. And so [unintelligible 00:19:49.01] a few system-wide distribution changes that will be ready once CentOS – I guess now CentOS Stream 10 is going to be released later this year.

Describe Fedora ELN. Why is that different than what you’re running?

So Fedora ELN is – man, I don’t know what exactly it stands for. Fedora-something Next. So it’s going to be like the next release of Fedora that will eventually feed into things like CentOS Stream.

Basically like the rawhide equivalent of like “Hey, this is a rolling kind of new thing”, but eventually that gets cut down. How does that relate? Or I’m actually really curious - CentOS Stream, when they moved to this rolling release style of distribution, how did that affect how you’re doing those releases and doing upgrades for those hosts? Because you have to at some point say “This is the thing we’re rolling out”, but the OS keeps going.

Yeah. I’d say the change to Stream didn’t really affect us much, because we were already kind of doing rolling OS updates inside the fleet. So when new point releases get released, we have a system that syncs it to our internal repos, and then updates the repositories. And then we have Chef running to actually pick up the new packages and things, and just updates depending on what’s in those repositories. So the change to stream didn’t really change that model at all. We’re still doing that picking up new packages on like a two-week cadence.

Do you guys use a lot of automation that you build in-house?

Yeah. We kind of have to.

The repo syncing - I had a project at an animation we had [unintelligible 00:21:19.07] that we would sync all the repos internally, it all sits on NFS, and then we mount everything to NFS to pull in repos… And I forget, it was like a Jenkins tree of syncing jobs that would all run to like register a system, and pull down 300 or something repos that we would sync every night, and like “Okay, let’s fetch all the files now.” And then squirrel those away somewhere on a drive, and then host them, so that everyone else can sync to it, and then have it like rollout to the testing fleet. It’s a lot of data, and it’s a lot of stuff that you just have to – as packages get removed from upstream, and you’re using them in places, I’m assuming you have some isolation there, because as far as I know, most of your workloads are containerized on Twine, on Twshared as like the base infrastructure, right?

[22:09] Yup. So containers, they don’t get the live updates that the bare metal hosts get. So users can just find their jobs in a spec, and for like the lifetime of the job, the packages and things that go into it don’t change. I mean, there are certificates that also are used to identify the job; those get renewed… But we have a big push to get every job updated at least every 90 days. Most jobs update more frequently than that.

Is that an update for like the base container layer, or whatever they’re building on top of?

Yeah, they’ll actually have to shut down their job and restart it on like a fresh container, and they’ll pick up any new changes to the images, or any changes to the packages that have happened in that time.

Can you describe Twshared for the audience as well? …because that’s one of the things that I think is really fascinating, that you have your own container scheduler, and as far as I know, all those containers are running directly with Systemd. You’re not having like a shim of like an agent… I mean, you have agents, but go ahead and describe it.

So I used to work on the containers team, the part that’s actually on the hosts. The whole Twine, our team, consists of like the scheduler, and there’s like resource allocation teams to figure out which hosts we can actually use, how to allocate them between the teams that need them… And then on the actual container side, we have something called the agent, that actually talks directly to the scheduler and translate the user specification into the actual code that needs to get run on the hosts. And that agent sets up a bunch of namespaces and starts Systemd, and basically just gets the job started.

And that’s Systemd inside the container?

Yeah. So the bulk of the work that is done in the agent, at least where the Systemd set up, is it translates the spec into Systemd units that get run in the container. So if there are commands that need to run before the main job, those get translated to different units, and then the main job is in like its own unit as well. And then there’s a bunch of different configuration to make sure the kill behavior for the container is the way we expect, and things like that.

There’s a sidecar for the logs specifically. So logs are pretty important, as you’d imagine, to users being able to debug their jobs… There is a separate service that runs alongside the container to actually make sure that no logs get lost. So those logs get preserved in the hosts somewhere.

Twine sounds really cool, too. I was reading the white paper about that yesterday…

How does that work with like the sidecar? I would assume - I’ve never really actually done this side the of… Like, Systemd inside the container, running on Systemd… So if I log into a host, and not the container, I see just services all the way down, right? They just look like standard Systemd units, they’re just isolated from each other, right?

Yeah. So the container job, it will be like one Systemd unit, and you’ll see a bunch of processes in it, and you’ll also see a couple of agents that we run, but mostly just the usual Systemd PID 1 one inside the container, and their own instance of JournalD, Logind, and all that stuff.

And that was the question I actually had, is like, I assumed that Journald would handle the unit login. But you said there’s a sidecar that I’m assuming is like getting that logs out to Journald on the host, or at least some way so that you don’t lose those logs inside the container.

That’s cool. At that point, it’s just native Systemd, really? Like, you’re just using every feature of Systemd to isolate and run those jobs… And then you have an overarching scheduler, resource allocator, all that stuff.

Yeah, pretty much.

[25:47] One of the things that I’ve found super-interesting in the white paper was host profiles, where different workloads - you basically like virtually allocate clusters, I guess, for a lack of better… Entitlements is what you call them, for like “Hey, this job gets this set of hosts”, and then you can dynamically switch those hosts to needing different kernel parameters, file systems, huge pages, and you have a resource allocator that does that, as far as I understood… How does that affect what you’re doing? You have a set of host profiles, you say “Hey, you can pick from a menu”, and then we know how to switch between them? How does that typically work?

So that part’s a little newer than from the time I was in containers. So you create a host profile, you work with the host management team to do that, and then you can, I believe, specify it in your job spec. And then when you need to either restart your job, or move the job around, they actually have to drain the hosts. Most host profiles require a host restart, because things like huge pages - you need to restart the hosts to apply… And then the jobs gets started back up on the host with the host profile you’re asking for.

How does that affect you as the OS team? Is there anything that you’re doing specifically for that?

Not specifically, but they do – so the host agent actually builds a lot of their components on top of Systemd as well. So they’ve been doing things like moving more configuration out of Chef into host agent, where it’s more predictable… So things like Systemd, Networkd configs, or the sysctl configs that also go through Systemd as well.

Is that a Linux penguin on your sweatshirt? Because that’s the coolest sweatshirt I’ve ever seen if it is.

Oh, yeah. The [unintelligible 00:27:28.11] hoodies… Yes, this is the one that Justin was talking about.

That is so cool.

Yeah, they had them at Scale, and I was very jealous, because they’re cool. And this is an audio podcast, so no one knows what we’re talking about… But basically, there’s a bunch of little, small tuxes inside the hood of the hoodie.

If anyone from Scale is listening, they probably have hoodie.

I’m sad that I missed your talk at Scale. It was on my schedule, and then I think – I forget what we were doing, but somehow I ended up somewhere else, and I’m super-sad to miss your talk. Do you get to contribute a lot to open source? Because Meta seems really big on contributing, or releasing things for free, I guess.

Yeah… I’d say at least the way the kernel team and our team operates is that we’re mostly upstream first. So everything that we write, we write it with the idea that we’re gonna be upstreaming it. And that’s how we managed to keep our team size small, so that we don’t have to maintain like a bunch of backports, things like that.

But you have to wait for it though, right? You’re like “We’re gonna write this internally, we’re gonna hope this gets upstreamed, and then we have to either wait for the release to consume it. Or we’re just going to keep running it”, but then if upstream needs changes, you have to kind of like merge back to it.

Yeah. So the kernel - we actually build and maintain internally, so we can kind of pull from the release whenever you want. And we can kind of do the same thing with CentOS too, because we all contribute to the CentOS Hyperscale SIG. That’s where any bleeding edge packages that we want to like release immediately, it goes into like the Hyperscale SIG.

It’s really cool that you guys contribute to upstream first, but also kind of maintain your own stuff, so that way you can kind of pick and choose if you want to put something – you know, it’s like a bug fix that you need earlier, you can already apply that.

I mean, I’d say Meta is super-into release frequently… And so if we always stick to like upstream, then we’ll always get like the newest stuff, and we’re less likely to run into some obscure bug from like two years ago, that was really hard to debug.

How does release frequently, and a million hosts go together? Because you mentioned that it takes about a year to basically rollout an update to every host? But if you’re pushing out updates to the OS every month, then you have like 12 different stages of things that are going through release, and that makes it really hard to debug… Like, “Oh, what version are you on? Did we fix that bug somewhere else?” How do you manage that?

[29:58] Yeah, so it’s mainly the major upgrades that take like up to a year. And we’re about to go from CentOS Stream 9 to 10. That will probably take a long time than if we were just doing our rolling OS upgrades. So the thing about CentOS is that we do maintain kind of like ABI boundaries. So we expect that the changes that Red Hat and CentOS are making to packages are mostly like bug fixes that won’t break compatibility in the program… And that’s remained true. We haven’t run into a lot of major issues with rolling OS upgrades. Most issues come from like when we personally are trying to pull in like the latest version of Systemd or something, and we’re rolling that out. Those we have to do with more intention.

You mentioned an AI fleet… From what I’ve heard Zuckerberg talk about is like Meta has more GPUs than anyone else in the world, basically. How do you manage that? Not only how are the drivers installed, because Linux and Nvidia aren’t always known to be the best friends, but then how do you like isolate those things, and roll out those changes?

Yeah, I’m probably not like the best person to ask about it, but we do you have a pretty sizable team now of production engineers dedicated to supporting the AI fleet and making sure that it’s stable, and that our train jobs don’t crash, and things like that…

Under Twshared, do they just show up as a host profile? Or do I get an entitlement that says “I need GPUs for this type of workload”?

It’s more like the latter. So even though everything’s in Twshared, we know what kind of machine type they are. So you can specify what purpose you’re using the machine for, and things like that.

What’s the difference between a production engineer and a system engineer?

Well, I’m a software engineer technically, I guess…

The title? [laughs]

So a software engineer, then there’s a production engineer, and a system engineer…

There are a lot of titles…

I know…

I’d say production engineer and software engineer are the most similar. Especially in infrastructure, when I was in the containers team, the production engineers and software engineers pretty much all just did the same stuff. We were all just focused on scaling, and making the system more reliable. I’d say in like a product team, production engineers focus more on operationalizing and making the service production ready, while the software engineer is kind of like creating new features, and things like that.

Okay, that’s interesting.

One thing I’ve found fascinating about some of the talks you’ve given and information is the fact that Meta is still notably an on-prem company. You have your own data centers, you have your own regions, you have machines… And it doesn’t seem like you try to hide that from people. You don’t try to abstract it away. At least I haven’t ever seen a reference to like “It’s our internal cloud.” No, it’s like a pool of machines, and people run stuff on the machines. And the software and the applications running on top of it are very much like a – this is just like a Systemd unit; you’re just running it containerized.

What other types of services do you have internally that people need? I mean, I saw references to things like sharding for like “Hey, we need just fast disk places, and we need some storage and databases externally.” But what are the pieces that you find that are like common infrastructure for people to use?

Yeah, I mean, I’d probably dispute the fact that people have to understand kind of like the internals of how the hosts and things are laid out… So the majority of services - we’re talking like millions of hosts in Twshared - they are running containers. And I’d say a lot of their knowledge about the infrastructure probably stops at when they write the [unintelligible 00:33:43.27] and to the point where they go into the UI and look at the logs.

So if you’re just writing like a service, a lot of that’s abstracted away from you. You don’t even have to handle load balancing, and stuff. There’s a whole separate team that deals with that as well.

That’s awesome.

[34:00] Yeah. But if you’re on the infrastructure side, sometimes you need to maintain those widely-distributed binaries on the bare metal hosts. So us running Systemd, or the team [unintelligible 00:34:07.21] that does the load balancing, they also run a widely-distributed binary across the fleet on bare Metal. There’s also another service that does specifically fetching packages, or shipping out configuration files, and things like that. But yeah, most of the services people write, they’re running in containers. Databases - they have kind of their own separate thing going on as well. Most of them are moving more into Twshared as well, but they have more specific requirements related to draining the hosts and making sure there’s no data loss.

Right. All those shards… Making sure that enough of the data replicas are available.

Yeah. But they’re like one of those teams that - they just want their own set of like bare metal hosts as well, to do their own thing with. They don’t care about running things in a container if they don’t have to.

Yeah, typical DBAs. [laughs] What would you say are some of the challenges you’re facing right now on the OS team, or just in general on the infrastructure?

The AI fleet’s always a challenge, I guess. Making sure jobs stay running for that long. I think every side event is kind of an opportunity to see where we can make our infrastructure more stable, adding more validation in places, and things like that. Just removing some of the clowniness that people who have been here a long time have kind of gotten used to.

And you mentioned that as far as like moving more things out of something – traditional configuration management like Chef, and moving it into more of like a host-native binary that can manage things, I will say more flexibly… And I guess more predictably. I think you’ve mentioned that, where it’s just like “Yeah–”

Yeah, making things more deterministic, removing cases where teams that don’t need to have their own hosts, shifting them in Twshared, so that they’re on more common infrastructure… Adding more safeguards in place, so that we can roll things out live, and stuff like that…

You also mentioned in the – again, referencing the paper, because I’ve just recently read it… All of your hosts are the same size, right? It’s all one CPU socket, and I think it was like 64 gigs of RAM, or something like that.

Yeah, that’s probably not true anymore. But yeah, the majority of our compute fleet looks like that, yeah.

Okay, so the majority of Twshared is like “We have one size”, and you’re just like “Everyone fit into this one size, and we will see how we can make that work”, right? Because you can control the workloads, or at least help them optimize in certain ways… Because not all AI jobs or big data jobs are going to fit inside of that envelope.

Especially with databases in AI.

Yeah. And we’re trying to shift to a model now where we have bigger compute hosts, so that we can run more jobs side by side, stacking… Because realistically, one service isn’t going to be able to scale to like all the resources on the hosts forever… So yeah, we’re getting into stacking. Yeah.

So yeah, it’s more like a bin packing approach and saying like “Hey, maybe we do have some large hosts”, I’m assuming especially for the jobs that do need like “Hey, I don’t fit in [unintelligible 00:37:11.02] of RAM”, and a local NVMe isn’t fast enough for whatever reason, or is going to cause the job to run longer.

Do you think AI is going to change the way that Meta does infrastructure, because you’re adapting to the change in how much bigger the hosts you need, and how much more GPUs, and all that kind of stuff?

Oh, I mean, even in like the past year, we’ve made a few notable infrastructure shifts to support the AI fleet. And it’s not even just like the different resources on the host, but all of the different components, a lot of them have additional network cards, managing how the accelerators work, and how to make sure they’re healthy, and things like that.

Yeah, I suppose once you have any sort of specialized compute or interface, whether that’s network, some fabric adapters, you always have snowflakes in some way, where it’s like “Hey, this is different than the general compute stuff.”

[38:07] Oh, yeah, for sure.

How has that affected your global optimization around things? And I know - again, the paper is old now. It’s like 2020, I think is when it was published… Which is probably looking at 2019, 2018 data. But in general, something like 18% overall total cost optimization because of moving to single-size hosts, because you’re just like “Hey, our power draw was less overall, globally.” So I think the web tiers was like 11% – I should have had it up in front of me… 11% more performance by switching to host profiles and allowing them to customize the host. Have you had things like that over the past four years, with these either optimizations in specialized compute, that have allowed you to even gain more global optimization? Because at a million hosts, a 10% gain in efficiency or lower power requirements is huge. That’s like megawatts of savings.

Yeah, we are also working on our own ASICs to do like inference and training. That’s probably the place where we’re gonna see not just like the monetary gains from developing in-house, but also on the power and resource side as well.

That’s fascinating.

That’s starting to come out this year in production.

Have you been enabling that through FPGAs that you allow people to program inside the fleet? Or how does that – how do you come out of like “Hey, we have an ASIC now, and it does some specialized computing tasks for us”?

Yeah, that’s a better question for the silicon team.

That’s right.

I only see the part where we actually get the completed chip, but I’m sure they’re doing their development on FPGAs.

And at some point they have like “Here’s a chip, go install it for us. And here’s a driver for it.” Right? They need to give that to you as a host team.

Oh, we have a team that actually I work pretty closely with, that writes [unintelligible 00:39:57.12] over the kernel. I think the accelerator is just over PCIe.

Meta sounds awesome. It sounds like you get to actually really dive deep on what you’re learning, and like your part of infrastructure, or development… Because it seems like you have teams for everything.

Yeah, you can really go as deep as you want to here.

Yeah, I really want to see an org chart now. There’s so many of these teams that just keep popping up, like “Oh, yeah, no, we have a team that does that.”

I know. That’s cool that it almost gives you enough abstraction that you can really focus on your specialty, because you get to really be deep in that area, because you’re not having to worry about all the extra components, I guess.

Yeah. That’s my favorite part. I mean, some people are just really into developing C++, or like the language. But then I’m on the infrastructure side; I just really like working directly with hosts.

And you’ve been there for a little while now, right?

Almost eight and a half years at this point.

I feel like people go to Meta and stay there forever… Because you probably get to get really good at whatever you’re doing. Plus, I feel like it would be cool to talk to those other teams, because when you have questions, they must be really good. If they’re so specialized in that area, then they must know so much about that when you go to like collaborate with other teams.

Yeah, it’s super-nice [unintelligible 00:41:13.27] Like, literally anyone, if you have a question. Everyone’s super-nice about helping you out, as long as you’re nice, too.

What did you do before Meta? Or is this like – have you worked at Meta your whole career?

Yeah, I started here out of graduation. I did one internship before I started here full-time.

What are you looking forward to working on in the next year? Are there big projects or big initiatives that you would like to tackle? Or even things in the open source, or like things that you want to give back and make sure other people know about?

I mean, I’m always interested in doing more stuff with Systemd. I think there’s still a bunch of components internally that could be utilizing Systemd in more ways, making sure that we’re on a common base. That’s kind of the main general goal that I’m always going to be focused on, I guess.

[42:10] There are also some bigger – I mean, Journald, I’ve been trying to get us to replace our syslog completely, and move entirely to systemd-journald. That’s an ongoing effort.

That was one of my best claim to fames at Disney Plus, was I disabled our syslog. I was like “No, we’re just doing Journald now”, and it saved us so much just like IO throughput on the disk, and everything… And there was a lot of problems with it, too. Maybe we weren’t ready to do that, but I was like “We can’t ship Disney Plus until our syslog’s off.”

Yeah, I wanna be there.

It was great. It was a great feeling one day, when I’m like “I don’t need this anymore. I don’t need our syslog.”

I mean, [unintelligible 00:42:50.13] Systemd Networkd was pretty cool, but… I mean, now that that’s done, I can just like be happy with it. There’s probably some more stuff we’re going to be doing with like systemd-oomd, the out of memory killer. I think we’re about ready to get Senpai upstreamed into Systemd. Senpai is like a memory auto-resizer that we wrote… And I don’t think that that’s been open sourced in any way. I mean, we have like an internal plugin to do that with the old [unintelligible 00:43:22.28] I think it’s time to get that into systemd-oomd as well.

Is that for resizing the container, the cgroup, and saying how much memory they have available? Or is that something different?

It’s a way to kind of poke a process and make sure that they’re only using the amount of memory that they actually need… Because a lot of services and things will allocate more memory than they need.

Interesting. Like a “Get back in line. You don’t get that memory.”

A little bit.

Yeah. Have you been doing anything with immutable file systems, or read-only, or like A/B switching hosts for – Fedora has Silverblue… I use a distro called Bluefin, which is kind of built on top of that, which does like A/B switching for upgrades to do reboots every time. It sounds like you’re doing rolling updates, so you would still be writing packages to disk instead of like flipping between partitions.

I mean, we’re trying to shift to like more of an immutable model internally. We have something called [44:22] And right now we’re rolling out a variation of [unintelligible 00:44:26.01] It’s similar to – the goal is like kind of an immutable file system, but it’s making strides to get there. We still have to rely on Chef to do a lot of configuration, but a lot of it has shifted to a more static configuration, that is more deterministic and gets updated at a cadence where we can more clearly see what the changes are.

And I was asking that because leading into you said you want more Systemd stuff, and I’m curious if you’re trying to use things like Systemd system extensions, or sysext, or whatever it’s called, that are like layering different things on top of Systemd… Which is typically for an immutable file system, but still allow changes to happen.

Yeah. I haven’t looked too deeply into what that team’s been up to… But I do know that they did make use of some of the bleeding edge Systemd features to build these images, and things like that. We’re not using Systemd sysext just yet. I mean, I wouldn’t count it out.

Yeah. It’s one of those things that looks really interesting, especially if you’re trying to move more into immutable filesystem layers… Lik, “Hey, I still need to configure this. How do I do that in a composable, immutable way?”

Well, Anita, this has been great. I’m just nerding out, because I’m trying to learn all the things that I’ve done in the past, and still doing in the future… And I think it’s great that Meta is not only doing this at like just a core level of just like “Hey, we just have Systemd, and things run in that”, but also giving back upstream with the Systemd builds, and all the stuff that you’ve been publishing in the white papers, which Autumn and I were reading, and talks, but also just the open source work… So I think that’s fascinating. And we didn’t even get to talk about eBPF really that much… Because that’s a whole other topic.

[46:08] Oh, yeah… [laughs]

You have to come back. I think Meta gets a really bad rap for a lot of things, but I don’t think that you guys get enough credit for the amount of open source you guys do, and the white papers… The white papers you guys have written on databases and the database contributions alone is amazing. And there’s been so many things given away for free, so people can gain knowledge. I don’t think Meta gets enough credit for that.

Yeah, I think from the engineering standpoint we just kind of get the warm fuzzies when people actually use and like the stuff we write…

That’s like the best part of being an engineer.

Well, I find it fascinating because Meta is one of the few places that doesn’t sell the things that they talk deeply technically about. Amazon and Google and Microsoft are like “Hey, we’ve built this amazing thing. Now go buy it from us.” And Meta is like “No, we’re solving our own problem, and we’re just giving it back to you.” And that’s a really [unintelligible 00:46:59.04]

That’s what I’m saying. I think that people talk about what Meta does wrong, but rarely do people talk about the fact that they’ll be like “Hey, I just figured this really cool way to do this at a crazy scale. And here it is. You can read about it and learn about it for free.” And I’m like “That’s awesome.” So… I think I’ve learned a lot from the different database papers, and different white papers that you guys have released… And it’s crazy that you guys released an entire AI model for free. It’s insane.

Yeah, I’ve been running a Llama. I haven’t done Llama 3 yet though, but it’s on my list of things to play with.

Awesome.

I feel like white papers are a great way to learn and really get in-depth for something, so you can go and like do that project or try something out, because you get to see why that solution was made for that problem, and kind of like figure out how to use the projects that you guys release. I think it’s cool the way you do that.

Yeah, I really appreciate the academic side of things.

Yeah. And then having a podcast, we get to have people come on like you, that are hands-on all the time, and just like figuring out those problems. So this has been great.

That’s so cool, to read a white paper and then get to talk to you about it.

Anita, thank you so much… And we’ll reach out, I’m sure, in the future with more things. Maybe in the future we’ll talk about eBPFs and ASICs, and more work that you’re doing on the OS layer… Because that’s just a fun thing, and seeing how it grows.

Alright. I’m looking forward to it. Thank you.

Have a great day.

Break: [48:29]

Thanks again, Anita, for coming on the show and talking to us all about how Linux is managed at Meta, and how containers run, and upgrades, and all that stuff. Again, if you want to read more, the link to the whitepaper is in the show notes, so check it out. For today’s outro I made up a new game with a silly name. This one is called “Faux or fo’ sho’.” And Faux being spelled f-a-u-x, so this is – you’re gonna have to spot the fake or the real thing in this list.

Oh, I’m dying… I’m not prepared.

So in today’s list of Faux and Fo’ Sho’ we have lists of white papers. Some of these white papers have been generated by ChatGPT. Some of them are real white papers that have come out in a fairly –

Oh God, “whitepapers generated by ChatGPT” just sounds *bleep*.

I mean, it only gave me the titles. I didn’t ask for a full white paper. I didn’t read it.

Oh, no…

But we’re gonna start with one that – I believe all of these are going to be loosely AI-related, because we’re on an AI theme with [unintelligible 00:53:47.15] So one upfront that is a very common and very popular whitepaper, for people that don’t know it - I’m just giving you this first one… It’s called “Attention is all you need.” And this was the Google white paper about transformers, and when they kind of introduced transformers into AI, and how words relate to the word before, so it can keep these big context windows. It’s a very notable AI paper, because it introduces transformers and attention. So that’s the first one. But that’s what we’re talking about here. So I’m just gonna give the title, and Autumn, you’ve gotta tell me if it’s Faux or Fo’ sho’. So the first one we will start with is “AI for all. Strategies to improve representation and inclusivity.”

I hope it’s real, because it sounds not horrible…

It is a paper that should exist, but it is definitely faux.

I feel like we should have known it’s too good to be true.

I know, right? Throwing it out there if anyone wants to pick these up.

It doesn’t sound like they get enough funding.

Yeah… Maybe you have to figure out who’s going to pay for that one, to do that research. So how about “Casually abstracted multi-armed bandits”? Those are words…

I think it’s fake… I don’t know.

That is a real –

Why?! What is this about?

I’ve just read the intro on this one… Multi-armed bandits is about like decision trees, and like how you figure out what to deal with… And so they’re abstracting these decision trees in AI. Yeah, these titles are just –

Bandits?

Bandits. Yeah, multi-armed bandits.

Okay…

So that one is Fo’ Sho’.

Goodness…

“AI and the future of work: adapting to automation and augmentation.”

That sounds real, but I feel like all the real ones are made up by AI, and all the fake ones are like “What is going on here?” So I’m gonna say this should be real, because it sounds legit… But it’s probably not, because it sounds good.

Yeah, that one’s fake.

Oh, my God.

If they sound too good… I probably should have picked more realistic titles, but all the realistic titles were just like really boring.

I should have known what I was getting myself into with you…

Yeah. So that one is definitely fake. Okay, here’s another. “The hidden costs of homogeneity: exploring diversity in AI development.”

[56:07] It sounds too good to be real, so…

I know, right? It’s just picking them out now. It’s just like “Yeah–” ChatGPT knows what should exist.

It would be a good idea. It might make things better, God forbid…!

And I picked those two specifically because Autumn did give a “Diversity in AI” talk at Scale, and so I was kind of leaning into that…

I’m not always this bitter, I promise…

I definitely was pulling those out as like “Oh, yeah. Autumn knows these should exist.”

Did you take this from my talk…?

[laughs] “ChatGPT, generate some white papers from Autumn’s talk.” How about “Probabilistic inference in language models via twisted sequential Monte Carlo?”

I think it’s real, but what are we talking about? What is sequential Monte Carlo?

These are English words…

Like, isn’t a Monte Carlo a car? What are we talking about? And I never know with you, because you’re always bringing out like a car example, or like analogy for everything… So I’m like “Did you make this one up, Justin? Or is it real?” It could go either way.

No, Monte Carlo is actually an algorithm.

Interesting…

It’s popular actually in the animation area, which is where I first encountered it… And I started reading white papers about how Monte Carlo works. Because people used it for how they’re rendering things, and how they’re rendering light into scenes… And specifically, I think it’s predicting on something. I have to look it up.

Well, I didn’t know you could use an algorithm to render light. That’s pretty cool. I mean, it makes sense, but…

I have so many cool white papers about ray tracing, where Pixar and Disney and DreamWorks, and all the animation studios were doing ray tracing early on, and figuring out “Hey, how do we figure out how these rays work?” But anyway, that’s a side – this one is real, again, because it sounds absolutely bananas, with the terminology and words that were used. So yeah, that one is Fo’ Sho’.

Every time you say that word, it makes me happy.

I know. It’s obvious that I should not be saying that word.

I need you to repeat the whole title before we like sign off, just so I can enjoy it like one last time.

It’s Faux or Fo’ Sho’. [laughs] I had to find something that had some alliteration.

It’s ridiculous, and I love it.

Originally, I had something around Cap and No Cap, but that was too boring, so…

Oh, we’re bringing Cap back for something. We need that. Oh, gosh…

I’m reaching behind me for the podcast listeners, so Autumn can see…

What do you even have in there?

These are my capacitors… And so the camera’s not going to focus, because – there you go. See, it’s labeled as Cap. And the bottom box here is all my resistors, and it is labeled as No Cap. [laughs] [unintelligible 00:58:55.26] It’s an audio show, so no one cares, but yes, my –

[59:00] Oh, you brought like nerdy and like pop culture together, and I love it. I love it!

So yeah, that’s my organization of my electronics in my shelf behind me.

This is why I’m still friends with you, even though you don’t drink coffee and you forgot how we met.

My son eyerolled like right out of my office when I showed him that. He just like hurt himself rolling out the door.

Oh my God, our kids are getting to the age where they can judge us. We’re about to have teenagers, and they are gonna side-eys us so bad. My kid already side-eyes me, and he uses my sarcasm back, and it’s so offensive. It’s like a tiny version of me, and when I say these things to people, I’m like “Oh, bird!” And then when he says it to me, I’m like “Ahhh…!” I’m like “Why would you hurt me like that?”

That is definitely the reinforced learning of the artificial intelligence model working there. Like “Oh, man, that was me. That was all me.”

It’s like hearing my voice back at me, and it is painful.

You have to think later, like, “Did I say that? Wait…”

It’s like his voice almost turns into mine, and when he does it to his dad, it’s hilarious. When he does it to me, I’m like “How could you do my own voice to me? I thought you loved me. I gave birth to you.” It’s horrible. I’m never sure if I’m like proud, or like hurt, or…

You can be both. It’s fine.

Some of them are really good. But then sometimes I’m like “How dare you…?”

So that’s all the titles and white papers that I generated… So we’ll let you off the hook. But I definitely think this one –

Dude, I cannot believe one had Monte Carlo. Now I want to go look up this algorithm, because that’s fire.

Yeah, this one definitely is going to be an outro that will probably come back at some point, because this was – I had more I wanted to do, and I was just like “I’m not gonna –”

We should do it for not just white papers too, but we should do like projects.

There’s a lot. There’s a whole list of open source projects, and…

Also, I just want you to keep saying that title over and over again –

Faux and Fo’ Sho’.

I cannot – like, this is how we know you spent too much time in California. That is such a California moment.

Thanks everyone for listening to this episode, and we hope to – if you want to reach out, we actually… I haven’t said - if you have people that want to be on the show, if you have topics you want us to cover, email us, shipit [at]Changelog.com. We read all the emails, we get back to most of them… And we always are looking for what you’re learning, and what you’re working on, and we like it to be open source, and what does it take to run software. And next week we have another retro episode that I’m just going to tease, talking about an old-school website that you may have heard of, and how it used to run 20 years ago. So…

Also, Rich is so much fun.

I know, Rich is great. So thanks everyone, we’ll talk to you again later.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00