Ship It! – Episode #85

The hard parts of platform engineering

with Marcos Nils, co-creator of Play with Docker & Play with Go

All Episodes

Marcos Nils has been into platform engineering for the best part of the last decade. He helped architect & build developer platforms using VMs & OpenStack, containers with Docker, and even Kubernetes. He did this at startups with 10 people, as well as large, publicly traded companies with 1000+ software engineers.

Today we talk with Marcos about the hard parts of platform engineering.

Featuring

Sponsors

Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com

Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.

Notes & Links

đź“ť Edit Notes

Gerhard & Marcos

Chapters

1 00:00 Welcome 00:50
2 01:10 Intro 04:34
3 05:44 Play with Docker 05:19
4 11:02 Play with Dagger 01:58
5 13:01 Hard truths about platform engineering 08:33
6 21:34 Building platforms 15:51
7 37:25 The second platform 06:59
8 44:24 The stateful data problem 02:20
9 46:49 The final platform 01:00
10 47:49 SA Quake III champion 15:25
11 1:03:14 Building a new platform 05:11
12 1:08:25 Serverless vs Containers vs VMs 03:45
13 1:12:10 Looking forward to 2023 02:58
14 1:15:08 Wrap up 01:11
15 1:16:20 Outro 00:42

Transcript

đź“ť Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Hey, Marcos. How’s it going?

Hey, Gerhard. Doing great. It’s a sunny day here in Punta del Este, Uruguay, and I’m really happy to be here with you to chat about technology, life, and whatever comes up.

Yeah, welcome to Ship It. It’s been a long time coming. I’m so glad that we’re finally doing this.

It’s great. I think this is the first time that I’ve been on the show, right?

The first time, yes. Not the last time. I’m sure it’s not the last time… Well, I say that; it depends how it goes. [laughs]

It really depends. Yeah, so let’s see if we can get some interest from the audience and make this episode like something for people to take with them.

So the first thing which I want to say is thank you for Play With Go.

Oh, my pressure.

What made you build it?

First of all, it’s a joint effort. These things are difficult to build by just one person, so I would like to congratulate and basically celebrate it with the other authors of the Playground. There’s one person which I started the whole Play With series thing, which is called Jonathan. We are colleagues. And the other person which helped me make Play With Go what it is today is someone in the Go community, which you also know, it’s someone also very close to you, which is Paul Jolly. He used to work in Go tooling, and I think he’s actually working in Go today… But he’s very involved in the CUE project right now, with Marcell as well.

So yeah, it’s a fun story, because we met in London… I actually went to the Go meet-up there in 2018 or ’19, I think. I can recall exactly… And he was presenting something around learning Go. I think the brand new go.dev domain was also published there, with Carmen showcasing it. I had a history of making Play With Docker, which - we can come back to that later. But in any case, I pitched the idea to Paul, he was telling me that it was very difficult for them, for the Go tooling community to be able to show people how to do specific things, especially with all the module madness back in the days, between different tools around how to handle dependencies, and all that. And I basically showed him Play With Docker, which is an open source project, and then we started brainstorming about, “Hey, how could we leverage on this to do something a bit more structured and robust to showcase Go use cases?” And long story short, a few weeks after that we collaborated together and then we shipped playwithgo.com.

And what happened afterwards? What happened after you got it out there?

Basically, the reception was pretty nice from people using it. I guess what I take with me of that experience is that I learned a lot during that process. First of all, I met people around the project; I think that’s what I like the most of doing open source, is the people around it. And I had the experience to do a little bit of pair programming with Paul, I learned a lot of things from him; hopefully, he learned from me. And basically, the community was super-open to it, they really liked it., and that allowed us to involve more people to actually produce more content for Play With Go. And if you go now, you’re gonna see that there’s a lot of things around more advanced use cases module retractions, or how to handle different versions, or how to bump a major version on a module, how to handle go mod replaces…

So it’s been great. I mean, I have to agree that it’s been quite stale for the past couple of months, I would say, this year… But we are looking for contributors, or like people that want to showcase different Go use cases… There’s the new Workspaces thing that we would like to include as well. But yeah, we’re looking forward to keep collaborating to it, and make it bigger to actually help people to grasp the more specific use cases of Go, which are not so much related to the programming language itself, but more the tooling around it.

So while I have used Play With Go and multiple times, and I’ve found it super-useful - again, thank you very much for that. And I really mean it. It’s been so easy, so easy, especially when it comes to sharing with others. This is it. Super-simple. I know that you started with Play With Docker; that was your first Play With thing. What was the context which led to Play With Docker?

[06:10] That’s a really funny story… I don’t know if I would state it as an example. Maybe it will, but… We were in Berlin actually, with Jonathan, the person that I mentioned before, which was someone that I was working with at the time… And we were attending an event that was called Docker Contributors Summit, or something along those lines. And one of the personal challenges that Jonathan and myself had whenever attending any of these type of events, either DockerCon, or HashiConf, or whatever, was to use the event to hack something really, really simple, to showcase probably people, and then to basically help the community in some sort of way; the community of technologists that were attending to that event.

And on that summit, we attended Jérôme Petazzoni’s Docker training, where he basically taught people advanced use cases of Docker, and then he showcased the latest features, and so on and so forth. And I recall that at the time Docker Swarm was becoming a thing… And then he had a lab where around 30 to 50 people were in a single room, and he was handling actual pieces of paper with IP addresses of different Docker Swarm nodes that you needed to use in order to follow the course, where you had to actually SSH into multiple terminals, then create a cluster out of Swarm nodes, and all that.

So then we were sitting there with Jonathan and then we said, “Hey, this is very confusing, it’s very difficult to follow.” And it wasn’t only us. People were saying, “Okay, how do I use this? What happens if I lose my paper? What happens if my connection drops?” It was also quite challenging for him to spin up all this infrastructure, because he needed to – he was actually using three to five nodes per attendee, and there were like 50 people there. So if you do the math, sometimes he was running out of like cloud resources to provision all that in a single availability zone, that was in Amazon back in the days.

So yeah, anyways, we realized that there was a process that could be optimized there, and then we said to ourselves, “Hey, it would be amazing if you could do all this in a browser.” You have your cluster there, your terminals there, you can share it with someone else, you can even invite people to collaborate with you in that environment a remote environment thing… So yeah, we basically – I still recall that one night we got some beers, and then we said, “Okay, let’s ship something this night. Let’s do a very minimal POC of how this would work”, and then we basically did it.

The other day, we – there’s a picture actually somewhere in one of the DockerCon keynotes where we presented the official project, there’s a picture with… There’s Jonathan, myself, Solomon Hykes, and then Julius Volz from Prometheus… And the four of us were drinking at a bar, and then we are actually showing Solomon play with Docker, right? And then I recall him saying, “Oh, it would be great if you could do docker run, and then expose a port, and then start an NGINX, and get like a public URL where I can connect to that service, the public service, by doing some routing magic happening.” And then the other day we actually shipped that…

Wow…

…and it’s there, out in the wild. Yeah. And then after that, we added a bunch of things. It became like a big thing. But yeah, that was the spark that basically started everything.

[10:02] It’s interesting how many great ideas start like that. “Let’s try and see what happens”, literally. “Let’s take a few hours a day, get it out there, and see if this thing floats or sinks. And if it sinks, that’s okay. And if it floats, how well does it float? And how much weight can we put in?” and things like that. So yeah, I mean, that’s what most of these stories have in common. Try it out, because everything is so random. No one can predict what’s going to work and what isn’t. Get it out there, and see if it floats.

Yeah, exactly. It’s about solving a user’s problem, right? Like, if you follow Paul Graham’s school, basically, it’s all about that. It’s all about the users. And in this particular case, we were presenting an alternative to a very annoying problem, and that actually seemed to work for people.

Yeah. Now, I know that you haven’t finished with the Play With series, and I don’t think you’ll ever finish; that thing is like one of your things. What is the latest creation in the Play With series, that I know most people will not have heard of this yet?

So the latest one, which is completely different from the others, because it’s not reusing any of the Play With backend open source stuff - which you, of course, already know - is Play With Dagger, I would say, or the Dagger Playground. In case people don’t know, I’m currently working at Dagger, with you, Gerhard. It’s a portable CI/CD system which is programmable; you write your basic pipelines with code. And of course, one of the challenges there is to actually show people how this programmable thing works, and what you can do with it.

So around two months ago, we released play.dagger.io, which is the playground that you can get into, and then you’re going to see – currently, you’re going to see a GraphQL interface, but we are improving that… Where you can describe your pipelines in GraphQL queries, and then you can run them out there, and you can basically share them with people, and then also bring them to the community to get feedback, or maybe showcase what you’re doing… It’s been great, because it’s a different type of playground than the ones that I’m used to, which presents its own challenges. But yeah, that’s basically the last thing we shipped.

One thing that has changed since is the URL. So if you’re trying to go to play.dagger.io and it gets a DNS error, that’s okay. It’s normal. It’s meant to be play.dagger.cloud.

Oh, you’re right.

So that has changed. There were many things that happened in the background. What matters is that there is a Play With, that you can try Dagger, it’s putting up the GraphQL API… It has built-in documentation, that’s really neat, and that all comes from the API. And a bunch of other things, but I’ll let you discover them if you want to.

Now, the idea of this episode started with the following hook: hard truths about platform engineering. So over the last eight years I know that you have helped build three separate engineering platforms for three different companies. Before we dig into what they were, and what worked well, and what could have been better, what does platform engineering mean to you, Marcos?

Hm… That’s a really good question. So if someone comes to me today and tells me “What is platform engineering?”, first of all, I would feel a bit confused about the term… Because a platform to me is not necessarily like something concrete, that you need to ship to accomplish a goal. I guess platforms - the objective is basically, as everyone knows, to make developers’ lives easier, to make them more autonomous… So you can do that in different ways, right?

[14:08] And the ultimate goal doesn’t need to be to build or ship a platform. You basically need to – you could solve that, the developer experience objectives, or developer experience tasks and goals by delivering a set of opinionated workflows, and basically present blueprints, or present golden paths to your engineers… But that necessarily doesn’t mean that you need to ship a platform for that.

So I guess that at some point in time people started converging all these ideas, of like how to build this experience into one single term, and then they started building products around it, and that’s what I believe the whole ecosystem calls platform engineering. But to me, I don’t see any specific, relatable deliverable with the term, and basically the goals that you need to achieve. So to me, platform engineering is basically making developers’ life easier, which is what sysadmins, DevOps, SREs, and a lot of - call it whatever you want - people has been doing for the past few years.

Yeah. So in the same context, when platform engineering gets mentioned, sometimes as clickbait, the following thing tends to appear, which is “DevOps is dead.” Now, obviously, DevOps is not dead, just to make it clear… But it tends to attract clicks, it tends to attract eyeballs. What is the relationship that you see between platform engineering and DevOps?

So I guess the natural relationship that people do is they usually try to encompass the platform engineering term in shipping a product a whole, fully-fledged platform that your company is going to use to do everything in it… And that kind of confuses who does what, because on one side you have the DevOps teams, which have been some sort of like siloed team in the company, that is usually working behind the scenes, providing tooling and workflows for devs to ship code. And then you also have the SRE team, which are generally more thought to be close to the infrastructure the cloud services, and the availability of the services. So when you bring a new term like platform engineering, and then you try to see who fits where, it becomes a bit blurry to me who actually owns that product. And that’s why one of the hard parts, which is this episode’s name, of platform engineering, is understand that in my opinion, the platform is built or should be built or could be built by everyone in the company. It doesn’t need to be a specific team that owns it, and a specific team that dictates what is built. Of course, there needs to be someone that drives the future, and then basically provides a frame for everyone to contribute to it… But what I’ve seen working the best is if you make everyone part of the project, and then you provide a framework where people can basically bring their opinions into it, and then, understand how those opinions could help others in the organization to basically build faster, more secure and reliable software.

[17:53] Okay. So if DevOps is mostly concerned with getting the code wherever it needs to be, whether it’s an artifact, whether it’s production, whether it’s staging, all the tooling that takes the code from your laptop and it gets it out in production; there is a lot of it. It’s usually CI/CD, but not only. You have security scanners, sometimes you shift left, and then some of that stuff happens on your laptop… All sorts of things around getting the code out into production. How does platform engineering change this? Because platform engineering is also concerned, with having a platform, having some primitives, having some tooling that people can use to also get their code out there. There must be a difference.

Yeah, that’s a good question. That’s why I’m saying that one of the hard truths or the hard parts about platform engineering is not that a single product is going to help with everything, is going to make everything magic. You are still going to need DevOps, you’re still going to need SREs. The only thing that I see – I guess we’re speaking about the current state, or how people are currently presenting platforms, right? So even if you adopt a platform, and that’s a very controversial topic, as well - like, “I’m going to be adopting something which is an out of the shelf solution” - you’re going to still need people that curate the golden path to basically do whatever thing you need to do for your software. Like either train a machine learning model, or like deploy a simple API to production… And those opinions, you basically need to talk to the users, understand their pain, and then iterate on a solution, gather metrics around that solution… “Okay, what am I optimizing for? Am I optimizing for the bringing down the change failure rate, or do I want to optimize shipping, putting codes faster to production? Do I want to optimize bringing down the downtime?”

So that team is going to have a specific metric that they’re going to be aiming for, and then usually, the team that does that is the DevOps teams, right? Because developers have a different set of metrics; product developers have a different set of metrics, that are usually more related to the company business. Like “Okay, we need to bring more users, we need to –” I don’t know, whatever that business metric is. So you need someone that is thriving for curating the internal, basically, shipping metrics. And usually, those people that are in charge of that should be DevOps teams, or could be DevOps teams. But one of the things that I’ve seen happening a lot in organizations is that DevOps teams don’t have a clear set of metrics, and that’s why the line becomes blurry when platforms arrive, because you have like SREs, and DevOps trying to overlap in different tasks. But in my head, and in my experience, the goals are very different and very clear, but they complement each other. Like, if you ship software safer and faster, that’s going to make the system more robust, hopefully, and more available, and that’s going to allow developers to ship more things, which is going to ultimately move the business metrics. So everything is connected, right? And I guess the platform gives, as I said before, a frame to all this, but it’s not going to solve it, it’s not going to like magically merge teams into some magical product that is going to basically fix all your things.

Yeah. So we started high-level on purpose, just to paint a picture of how complicated this is. And everyone has a slightly different opinion, and also a slightly different experience. And you yourself had three separate experiences building platforms, or contributing to teams that build platforms, and each of them had very different outcomes. So I’d like us to start digging into that. We can start with the first one; that was, I think, about eight years ago… What was the context in which that platform was built?

[22:06] Yeah, that’s a very nice story. So I guess what I wanted to bring to this episode, as you mentioned, Gerhard, is that first of all, platforms have been here for a very long time, even before eight years ago. And the context that I had when that happened was that I was working for a very, very large eCommerce company in Latin America called Mercado Libre. Back at the time it was like a $100 billion company. Now it’s way less because of the market… But in any case, there were more than 1,000 engineers; we had like more than 2,000 applications back at the time. Probably now it’s like way more; probably bigger than 5k, or something around those numbers. And cloud was still on its very early days; very few services were available, only like one or two players were there… And the company had pretty much all its infrastructure on-premise. So we were hosting our own data centers, networking stack, and pretty much everything. And we recently adopted OpenStack, the project, which - you could argue that OpenStack was a platform as well, but it had like way higher objectives, which was trying to basically help in all the on-premise challenges.

But in any case, Docker wasn’t a thing. it was around 2013, 2014 and Docker was still on its very early days; it wasn’t production-ready. It was like a toy project. And then we basically needed to give developers - what I said before more autonomy, provide them a golden path to the things… Containers weren’t a thing, so we had to basically provision VMs to run dedicated workloads on each VM.

And then we came up with something that is called Meli Cloud. Meli is the name of the company, the public ticker, which was basically a set of services built by what we call the architecture team. We had like the whole infrastructure department, we had different teams. One was managing the OpenStack deployment, the other one was managing the networking… And our team, which was called architecture, we were – I mean, back at the time we were more than 15 people working on that. We were basically taking care of what you would call a platform team today, right? It’s funny, because we weren’t even called DevOps; we were software engineers working on, I would say, cloud services internal cloud services. So we were basically providing that, right? So what we shipped back at the time was a very simple CLI, which was called the Meli CLI, where you could basically create what we call a pool. You had like a pool of VMs for your team, and then you can spin up multiple VMs, and then push your application, generate a bundle out of it, and then you could tag the bundle, and then deploy that bundle to production, everything through the CLI… Which was basically a very basic and opinionated golden path, but it basically solved a lot of headaches for people to actually do that simple task.

So what worked well, with that platform, the things that you are most proud of in practice, the things that were good in practice?

[25:50] So the things that I liked the most, and the things that I saw people actually enjoying was the fact that they give them a lot of autonomy when they had to manage their resources. So the typical flow back in the days was that you got into the company, you downloaded a CLI, and then you configured a set of credentials, which basically gave you some permissions to do specific tasks. But then after that, you could basically create an application that would give you a template with everything you needed to do things.

It’s funny, because it’s very similar to – if you see more advanced “platforms” today, it’s pretty much the same flow. You get a template, and then, of course, you get a Git repo out of it, and then when you push to that repo, there’s a series of hooks that get triggered… As I said before, we didn’t have containers, so there were some convention Bash scripts that you needed to write in order in order for your application to build, install dependencies build, and then being monitored; it was pretty much Bash back at the time. I wouldn’t be surprised if they’re still using that, to be honest… And then once you pushed that, there were some services, which were the ones that we basically built, that basically checked that all of that was in place, and then basically make sure that we deployed that thing to a specific pool of VMs. And then you could select between like a rolling deploy, or an A/B deploy… You know, pretty similar to what we have today. But we were using basically the OpenStack foundations for that, so we had to write a lot of code to make that happen, to orchestrate between different components of the OpenStack platform, and do all that work ourselves.

It’s interesting that it is these principles that stay the same. Even when buildpacks came along, you had like the different things that would run, and even when you were to implement your own buildpack, you would basically fill the template with whatever you needed for your build pack. And I remember doing that a couple of times and thinking, “Wow, this is really simple. It’s still scripts everywhere…” Yes, there was scripts everywhere. And I’m seeing something similar now with GitHub Actions, where you have like those little fragments, those little actions from the marketplace that you run, you config in your pipeline… And they can be anything; a lot of them are TypeScript, some are, again, Bash… I mean, that thing hasn’t gone away. I’m sure there’s a couple other examples. I’ve even seen - and those are my least favorite ones, the ones that have to build a container to run the action. That takes a long time. It can take many, many minutes. But the principle is the same. You have like some script - for a lack of a better word - that runs, you combine that script with a bunch of other scripts, and then you get a workflow. And the idea is the same - health checks, the same. I’m curious to ask you about the metrics, but I don’t think that’s relevant anymore. Many things have changed. But the basics have mostly stayed the same.

So in that world, I’m going to ask you, what could be better? And I’m going to also answer it. VMs, right? We all know. So let’s skip over that answer, because VMs have their own downside. What could have been better in that world, apart from VMs?

So we made a lot of mistakes by building the platform. I guess a lot of the mistakes that we made were also contextual to the infrastructure that we had, the decisions that we made… But one of the things that I really recall - I don’t know if regretting, or like learning a lot of things the hard way, is that because we had an on-premise deployment or an on-premise thing, we wanted to build a lot of managed services for teams. For instance, if you had to deploy a memcache cluster, if you wanted a MySQL cluster, or those basic services, Elasticsearch as well - so there were two ways that you could do it. And remember, I’m talking about sub-1000 engineer organizations here. So there’s a bunch of people that usually don’t communicate with each other, and they reinvent the wheel multiple times; that usually happens.

[30:20] So what we did is that we tried to – from the architecture team, we tried to come up with services similar to whatever you can find in any cloud today RDS, Elasticache, whatever… But we tried to come up with those services ourselves. And we tried to do it in a fast-paced kind of way, where we could show developers that we were shipping fast.

So I recall that the first thing that we did was called BQ, which was an analogous thing to SQS, I would say; it was like a queue service where you could push a message and then consume it from somewhere else, and it was our initial approach to deliver an eventful system, event-driven storage system between applications, so you could get a notification when an entity changed, and then that got replicated all over the place, and you don’t lose a message, whatever…

And the thing that I remember the most was, first of all, it is very difficult to build a high scalable, high throughput distributed systems yourself, right? There’s PhDs out there that are actually doing this S3, SQS, whatever, and we were just a bunch of senior software engineers, but trying to tackle very challenging problems. So we found a lot of issues across that way. And even though we solved the problem for some time, eventually you had to basically re-adopt or basically redo all that work, because it wasn’t scaling. So we were basically trying to chew a little bit of a bigger bite than we can actually [unintelligible 00:32:11.16]

So yeah, we made some of those mistakes multiple times. So we tried to come up with this service, with [unintelligible 00:32:19.04], which - it worked pretty well for some time, but then it wasn’t scaling anymore. It was built also on top of Node.js, which was around that time 0.4, very early days, we were adopting very edgy technologies as well… So I guess we fell into the trap of, “Hey, we are at the very top of the of the wave, we’re riding the wave on the very, very edge, so let’s try to do crazy things. Let’s try to copy Amazon in what we’re trying to do”, and then those weren’t really good decisions.

How hard can it be, right?

It’s extremely hard, yeah.

So if you’re building a platform today – I will try to bring that example to today. If you’re trying to build a platform today, whatever that platform is, and then you need to build those services yourself, or you’re either – sometimes you’re not building the service, but you’re adding a lot of like logic to an existent service, for whatever reason… Like, you try to make it highly available, or you try to do like automated replication, or backups, or something to magically happen - be very mindful about that, because it is not an easy task. And usually, not even if you have like one or two very experienced engineers working on that. Try to be as simple as possible when designing those systems, because it’s not an easy thing to do.

[33:48] Yeah. I think it was 2014-2015, around that time, when I was involved with Pivotal Data Services. And I was on that team, so we were building a bunch of stateful services, we were managing a bunch of stateful services in the context of a platform. This was the Pivotal Cloud Foundry platform as a service; you could run it on-prem… Anyways. And that problem was really hard; really, really hard. Especially when you had production data in those systems. How do you do upgrades? And the distributed systems - I’m thinking specifically RabbitMQ, because I spent a long time in that world. Queueing, you mentioned, is very hard. And once you can put a dollar amount to every minute that this system is down - wow; you’re starting to see some serious issues, and you’re starting to see some serious consequences of something being down. And because the stack is very deep, you’re building things on top of things, you’re affecting things that you don’t even know exist. And then you start seeing like weird failures in organizations because payroll is down. Why is payroll down? because the upgrade is going through; there’s a lot of data to migrate, and it will be down for another couple of hours. Now, that is not the worst thing that can be down, but I think fast food orders are maybe top of my list. When people can’t order things from their phone, or I don’t know, cars can’t get unlocked, because there’s a service bus that makes use of this queuing system…

I still recall like one last fun story around that - we also built like a distributed caching system. It was similar to Elasticache from Amazon, but the difference was that we needed it to be Redis, I believe, and Elasticache was Memcache, I believe, initially… So we basically built an API that accepted writes and reads; we were both Redis and Memcache protocol compatible, but under the hood it was only Redis… And I still recall that the day – of course, we did a bunch of tests on the service, and all that… The idea was that you said “Okay, I want to cache” and then we automatically provisioned like a multi-zone, full turnaround, replicated caching solution for you. So you didn’t need to deal with that yourself. And I still recall that the day we put that in production, or the week, it was the company end of the year party; and then we built the service using Node.js, because we had a lot of experience in Node.js, and we were using like an other U cache inside the API so we keep the hottest keys in memory and it was like a very famous other U caching library. And when we were at the company party, we got a page saying that the service was basically being restarted because of some reasons, and then we basically investigated, and then we saw a memory leak. Long story short is that the caching library that we were using never evicted keys. So it was growing to the infinite. And yeah, basically, then we had to patch the thing upstream, and then it was like a very difficult thing to do… So yeah, anyways; basically, delivering and working on these very sensitive, and supposedly infinite scalable systems - it’s super, super-difficult.

So that was the first platform. We still have two more to cover. Let’s move on to the second one. What happened with the second platform that you were involved with?

So it’s also very interesting, because the second platform - and you’re gonna see that in these three different stories the platform itself is a completely different outcome and product. And that’s what I think is the takeaway of this episode - “What is the platform, and how do I do it, or what do I do?”
[37:54] On the second case, after I left this [unintelligible 00:37:55.17] company, I went to like bootstrap a startup with five friends, right? So we were only six people working there, and it was a machine learning startup. The people that were working there had few knowledge on cloud, and distributed systems; they were mostly physicists working on AI. So they mostly knew VMs, and a lot of Python, and GPUs and that’s pretty much it, right? And we basically had to build – we were trying to build some sort of an as-a-service AI things, and we had to build the whole pipelines to basically train the models, ship the models… Because before that, these physicists were just – they were having names on the VMs… And you know where that comes from, right? When you name your VM like a ninja turtle, or like a Pokémon, whatever… DaVinci, Michelangelo, and all that.

So yeah, we basically were only two engineers working on “the platform”, and we needed to basically come up with a workflow that allows people to ship reliable code. It’s all about that, right? So during that time, we learned a lot about AI, and GPUs, and all that. And Docker was already an important thing in the industry. Docker Swarm wasn’t there yet, so the whole orchestration wars - I guess the only thing out there was Mesos, and as you probably remember, Mesos was initially aimed for very large organizations, so we basically used the bare metals of Docker. And then we built like a very simple “platform”, which in this case it wasn’t even like a CLI, or like anything that you could run locally. It was mostly like a very opinionated workflow on how to ship the code. And it basically worked in a way that you provided us a Docker file, and then you basically pushed code to your repo, and we took basically the responsibility of kicking off the CI/CD pipelines; everything was using Amazon back in the days… And then we were triggering all the build cycle in the Amazon building services; I think it is still called [unintelligible 00:40:35.11] We will just packaging the AMI, and then we had an agent running on a VM or a set of VMs in the cloud, which was basically picking the artifact, and then deploying the thing into an autoscaling group, and that’s pretty much it. And then you also had the ability to fine-tune how you wanted the deployment to happen; like, A/B because you wanted to try something, or if you wanted to do a rolling as well… But it was like a very minimal and simple thing; there were no steps involved that the developer had to do, but to create a Docker file. And I believe that that was the magic of it. We actually managed to go very far with that simple approach. Of course, it was like a completely different context, but our main contribution to basically the whole stack was making sure that the flow is usually simple enough to follow, and that developers had to do – especially these people that came from the AI world, where they knew very little about services, they actually had to write the minimal amount of descriptors, which were a Docker file, in this case, to basically be able to package and ship their thing.

[41:55] And that was pretty much it. We were very happy about the outcome, because even though we were a very small company only six people, again a very, very tiny startup, we managed to bring some hard opinions on how to accomplish very specific tasks, which these AI engineers were ultimately very happy about, because they didn’t know anything else. So before what we shipped, they usually created the VMs manually, they’d SSH into them, and then they uploaded the whole – they basically cloned the repository, and then they started there. And that’s it, right?

Remember that this was, again six years ago, probably… So still, what we knew about platforms were very, very early stages, and they were very difficult to operate. This whole platform engineering, or PaaS term is not something new; you could basically argue that it was mostly coined by Heroku maybe, or something around those dates… But people that have been in this space for quite some time you and me, and probably other people in the audience, will remember projects like [unintelligible 00:43:12.08] as well… And it’s funny, because nobody – I mean, I haven’t seen people mentioning those projects right now, today. If you go, for example – the holy grail of platform engineering today is, I would argue, things in the CNCF, right? So if you probably google “platform engineering” or something, you’re going to land probably in a project that is somehow related to the CNCF. And if you go to the CNCF, all the platform things basically go around Kubernetes in some sort of way. But if you actually dig into the very deep of the early days of platform engineering, you’re still going to find projects that are active in GitHub, which are the ones that I mentioned before [unintelligible 00:44:02.25] which are still a thing, and people still use. And you could also argue that those are platforms, right?

So I have one question regarding the startup, and the second platform that you were involved with. How did you solve the stateful data problem? Because that’s the really hard part. Whatever platform you have, there will be state, and usually lots of it. The more state you have, the faster it’s changing, the harder the problem. You need to distribute it, you need – oh, there’s so many things. How did you solve it in your startup?

So the good thing is that since we were more connected to the cloud, we were just basically using Amazon, we relied on the services that Amazon provides to manage state; basically, all the things that needed to be transactional were basically in the RDS database… And you could create a pretty reasonable multi-zone database back in the day, so that was very nice.

The other challenge that we had was, of course, the state of the machine learning models; when you train a model, that basically generates an output, and then you need to ship that model to the VM where the model is running. There was a lot of funny things that we did with containers and AI, but that’s probably for a different episode and audience. But as part of this platform, we had to come up with a service that basically helped these AI engineers to basically move the state from one place to the other in order for the applications to work. So when you needed, for example, to deploy a new version of a model, and you wanted to do some sort of A/B testing, which - that’s something very common in the AI world, where you need to keep the old model and the new model, so you can compare performance between the two… Back at the time – I know that today there are more evolved platforms, like this MLops platform that is very popular, I can’t recall the name… But in any case, we had to build a service that took a snapshot from the EBS, created a new EBS out of that, then spawn the autoscaling group, connected all the pieces, and all that bit, right? And I guess that’s what we usually call DevOps kind of work, right? We weren’t calling that platforms back in the days; we were just calling that our day-to-day job, which was basically building solutions for AI developers to basically be able to ship and manage the state, and be able to basically test different models.

What about the last platform? I think this is the one that you have many things to say about…

Oh, yeah. The last one - I would think this last platform, because it was one year ago, is going to probably resemble to the newcomers, and to the most young audience, that is basically navigating the whole current platform engineering trends, and CNCF, and all that. So hopefully, that’s going to be a good takeaway for you. And afterwards, we could wrap up with some conclusions out of this whole story.

But anyways, the last platform that I worked on was Wildlife Studios is the company; it’s a resilient gaming company. If you play mobile games, and you’ve seen like Tennis Clash, or Zooba, and Sniper 3D, this is the company behind it. And it’s an interesting –

Sorry, I have to interrupt you. We need a cleanser. You are the South America champion of something related to gaming.

Oh, yeah.

Tell us about that… Because I want to have it in the recording for us to know what that is, and I could reference it back.

Cool. Yeah, so back in my young days I was very fond of FPS games, and I happened to land in the Quake series, for whatever reason. And I basically started very early days on the internet multiplayer gaming thing. So I started with Quake 2, using dial-up… And then I moved of course to Quake 3 using cable, and basically, I started to like the game a lot, so I dedicated to play one on one, or two v. two in Quake 3. And one thing to the other one, and then I became – I mean, for one year I won basically the Quake Pro League something in South America… So yeah, you could say that I was in the very best top players in Quake 3 in Latin America for – I can’t still recall the year; I would say 2004 and 2005, around those dates. So yeah, that was really good, good dates.

I’m going to make an assumption now… It’s unlikely to be true, but hopefully it will be funny. Is all your involvement with platform engineering, and all the problems and all the frustrations that were building up during the day, that you would take to FPS, and use all of that frustration, and you channeled it in the game? Is that how it happened?

Well, maybe it was the other way around, right? I took all the angriness and all those sentiments from the game and then I put it into platform engineering. Because platforms happened after Quake. But yeah, you could say that. It’s funny, because there’s – I don’t know if you saw it, but there’s a game which allows you to terminate Kubernetes pods by playing… I can’t recall if it is Doom, or another FPS game… But basically, you configure a Kubernetes cluster, and then you are in inside a Doom map, and then all the pods are enemies, in that map. So once an enemy is basically killed, the pod dies. So…

Wow, I haven’t seen it. We have to link it in the show notes. Okay, so this was hopefully a pleasant segue for listeners, and now we’re going back to the third platform, the one that you started talking about, which was for a gaming company.

Yeah. So I guess the most important thing and also takeaway from this is context, right? Context matters a lot when you’re building this type of solutions. And when I arrived at the company, they had a quite large, you could call it DevOps… To be honest, I don’t like the term DevOps, because to me, everyone that basically produces software is a software engineer, right? You’re working on a different problem, which is infrastructure, or developer tooling, but you’re still a software engineering that does a job; same job. I guess the term DevOps is easier to use to basically hire people, because you need certain sets of skills… But to me, everyone that writes software, or interacts with software in some sort of way, it’s a developer or software engineer.

But anyways, when I arrived there, the SRE/DevOps teams were 5 to 10 people, but the most critical thing, which is similar, was that developers were already exposed to some underlying concepts of “the platform”, because they already had a platform they were running, whether you like it or not. They were using Kubernetes, and were very early adopters of Kubernetes. They were following a GitOpsy approach, but not fully GitOps, because they didn’t have an operator in the clusters managing state; they were versioning their deployment descriptors, in this case where Kubernetes manifests, in Git. But the way that those got applied was they basically kicked the CI pipeline, and then the CI pipeline, when it finished, they did a kubectl apply against the cluster, right? So it was imperative-ish, more than declarative, than what GitOps basically tries to evangelize for.

[52:18] So it was important to take note that developers already had a workflow. So they were already exposed to Kubernetes manifests, they were already exposed to kubectl… And to be honest, a lot of them were happy about it, because it was another tool in their tool belt; they could also find a lot of Kubernetes content out there to learn about, they could find courses, there are even books on Kubernetes… So developers are curious about things, and they like learning new stuff, so they actually liked a lot of the things that they used on a day to day basis. But of course, some people were frustrated, because they didn’t know Kubernetes, and they didn’t want to learn it. So that’s fine. And they also had to deal with a handful set of TerraForm stuff, because in order to get a database, what the team built before I joined the company was like a very automated pipeline where you had to write your own TerraForm resources, and then that basically got built, and then provisioned for you, but you still needed to make a lot of mistakes along the way to get it running.
So what happened - and please be careful if you see this happening in your organization - was that some new VP was hired… That usually happens in large organizations. We’re talking about - this wasn’t as big as the first company that I mentioned, but this company is big enough. We were around like 200 to 400 engineers. So what happened is a new VP arrived, and said “Back in the company that I was working before, that I’m not gonna mention, we had a team build the platform, which is basically a centralized UI and control plane, where developers could jump in, and through a UI request a database, deploy applications and all that.” So he basically came with his very constrained and opinionated way of building a platform, and basically told these SRE and DevOps teams, “You need to build this.”

And of course, I was actually there when that thing happened, and then we started researching, “Okay, what should we do?” Because embarking into a project with that magnitude, where you need to basically try to see how you’re going to leverage the current workflows that you have and try to make them more API served driven - it’s a very complex task. We also didn’t have like any experienced person in the team to build the UI, which is a totally different set of skills… So we tried to look out there what we could be using to basically deliver something like this. And then we came up with what a lot of people probably saw recently, which is the very early days of Backstage, which is the developer portal that Spotify basically open sourced… Which is claimed to be like a tool that you can use to basically build a platform for developers.

It’s a very interesting project, I would say. It’s very complex as well. It has a complex architecture. But the most important thing is that Backstage brings its own set of opinions, right? It is designed in a way that it’s meant to be very extendable, and very pluggable, and I would argue that it was designed for a very specific type of organization and model of organization that we clearly didn’t have.

[56:14] So what happened, to make the story short, is that we adopted Backstage, we started doing some changes to the core of Backstage, because it wasn’t designed for our organizational model and what we needed to build, and then that led to a multi-month, massive project where it wasn’t clear what everyone was doing, because this platform was supposed to serve the data team, the traditional application team, the gaming teams… And basically, several months passed by, and because this Backstage thing also came with their own opinions, which didn’t involve developers having that much tooling locally to work, we were basically taking away some of the tools that developers were used to, like kubectl and all that, because we wanted to simplify that process. But then we didn’t realize that people actually liked to be able to do some of those things, because they felt they had more control.

So in any case, we started building the thing, and then at some point we realized – we said, “Hey, it would have been way easier to iterate on the workflows, and on the basically pipelines, and the golden paths that we currently had by slightly changing the experience with little improvements, than to basically throw this big, massive thing to try to fix it.” And to summarize, basically… My feeling - and that’s why this is a hard truth or hard part of platforms, is that in my opinion, and again, this is a personal opinion, given experience and all that… Trying to implement an either out of the shelf platform, or like either try to implement an “Oh, let’s build this platform thing in the company”, most of the time is going to be very resource-intensive, and it’s going to generate a lot of noise within your developer teams, and within basically your SRE and DevOps teams.

So my suggestion for companies, either small or big, trying to tackle these challenges, is start small, and think mostly around the golden paths and the opinions that your company currently has, that your developers are actually asking about. Because if you bring an already-existing platform, you might be lucky and it could work for you, but if you bring something that already has opinions, you are probably adopting someone else’s opinions on a different context, that is going to very likely not work for your case. Even though everyone is trying to do the same, which is basically ship applications, the context and the past matters a lot, because people need to feel confident about it, and they need to reason about it and understand it in a way that makes sense for them. And I guess ultimately, it’s all about people and interactions, right? So the platform should be like a project or a tool that makes developers feel more confident, and if you don’t take into account the past experience and the current workflows that people have, and are currently using, it’s going to be very difficult to generate adoption for it.

[59:52] I think most of us suspect that big bang rewrites are a bad idea. And what typically happens is that you’re trading problems that you know, and are familiar with, some intimately, with problems that you don’t even know exist. And you may not like what you’re getting. And it’s impossible to know what you’re getting, because everyone will tell you about all the positives. And to be honest, most people don’t even know what the negatives are, until they try it out. The bigger the change, the higher the risk it’s not going to work out as you imagine it. So how do you minimize the change? How do you improve what you have? How do you make those small, incremental daily/weekly changes and see, “Is this better?”, rather than the whole big bang, “Forget what we have, we know we can do this better, start to rewrite…” Or “We bought it, and it will solve the problem for us.” No, it won’t. It will bring other problems. And it will make some of the problems that you have maybe redundant, but you don’t know what you’re buying into. You don’t know what you’re getting yourself into. And I think this astonishment and disruption that you’re going to inflict on everyone cannot be underestimated.

Exactly. I totally agree. And it usually helps a lot if you drive those conversations backed by data. One thing that I also believe didn’t do in our last platform implementation was to basically present developers - and not only developers, but also upper management - with the current raw data of our current processes. Okay, how much time it takes to onboard a developer, how much time it takes to deploy something, how many – you can take the DORA metrics if you want, but it’s important, once you get those metrics, pick one and understand how you can optimize that one, and then really take a conscious decision if you either need to build something new to tackle that particular problem, or you can tweak, iterate on a current workflow that you have to basically get that metric either up or down to whatever level you need, in order to keep forward.

And I would argue that in most of the cases, you can do little increments on your current processes or software cycles, that could help accomplish that without trying to adopt or find like a magical platform solution that fixes that.

Yeah. And the other thing, which would be worth mentioning is that when people identify, they go “We’re going to try this new thing, but we will only take a subset of the applications. We’ll start with one, or a few”, what typically happens is that a year or two later you end up having two platforms; half your workload’s in one, half your workload’s in the other one. That’s what usually happens. So be wary of that. Some will succeed to solve this problem, but most of you will end up in that world. And it’s also not good, because you now have twice the maintenance, twice the upgrades, twice the way things go wrong in different ways… And - well, that’s a good challenge. And if you enjoy a challenge, go for it. But maybe there’s a better way. So if you were to build another platform today, where would you start?

I think I have a very concrete answer for that. So if I would build something that can be called the platform, the most important thing to me to build is a centralized visualization place that people know to go to basically do something. Of course, you need to build the workflows that I mentioned before, and I’m going to probably build those based on my experience and what I know, because I’ve already been in this space for a long time, so I have a lot of opinions on things that I would like to adopt, and things that I would like to do. Of course, I would optimize for the simplest solution, so it’s understandable by anyone.

[01:04:04.19] But in my opinion, the most valuable short-term outcome you can give your developers, no matter how big the company is, is have a place that people don’t need to think about, that they can naturally go and get an answer to the question that they have. For example, how do I create a new application? Go here, and then I’m going to give you a set of steps, I’m gonna give you an API, whatever that is; you can decide on the implementation based on your experience, whatever works better for you… But it’s nice to have a single place to go and to that specific task.

“How do I now see the metrics of my application?” “Go to the centralized place, and then I’m going to take you to whatever monitoring system that we have implemented, and then you can go from there.”

So having like a central cockpit to see the state of your app, understand how to do things, see the metrics, and then get feedback, maybe ask a question to your teams through the messaging systems that you have, and potentially start building on top of that - in my opinion, it’s the best thing that you can do for any stage of the company, especially now that we are remote. Because that simplifies and augments communication by a huge, huge amount. So yeah, I would focus primarily on that.

Right. Is there something that exists today that you’d be tempted to try out as a first step? Something that you wouldn’t be building from scratch, something as a starting point that you would use. Or maybe a couple of things that you would use.

Hm… That’s an interesting question. To be honest, I don’t think there’s something I would use. What I would do until I can build that is - since it’s all about communication, as I said before, I would build a platform probably based on whatever communication system we have in the company; either Discord, or Slack… They are very well integrated things that you can leverage and build a lot on top of. So I think what I would do is I would start building some sort of like ChatOps flows, where you can go to these communication things and say, “Okay, what operations do we have possible? You can do this, this and this.” Initially, I would do them through input/output; I think we will call it prompt engineering now, after ChatGPT… So I would do some prompt engineering there, where you can get feedback rapidly through that channels. And then eventually, when I have time, I would go towards like a UI-based thing, because there are some things that can be explained through the terminal, or through text. But I think that there’s a lot you can accomplish through text initially, right? So you can – if you want to say, “Hey, I want to see, for example, the metrics of this application”, you can redirect people to whatever monitoring system you have, as well as login system you have as well. So personally, I’m not currently looking at any particular product or service that could fit into that space. The only thing that I’m looking at, but not directly relatable to like build a platform is, of course, the new WASM trends that have been happening out there… Especially – I think Fermyon is the company, and there’s another one called Cosmic something… But I’m not looking at those as ideas on how to build a platform, I’m basically looking at those on what is the opinionated workflow that they present for developers to build and ship an app. Because I’m going to probably – eventually, in the future, if this new paradigm brings new, different ideas on how to simplify that approach, I’m going to probably take those ideas and build something within the context of my teams and my organization, with the hope to simplify the process for my company. It’s going to be very difficult to adopt them as they are, and try to implement them for our teams.

[01:08:24.20] Yeah. Can I assume that you would pick containers over VMs?

Yeah, you can assume that. That’s the right assumption. And I think the tough question is “Are we going to pick serverless over containers?” Or what’s gonna happen there, right?

Hm… Interesting. Which way would you go, serverless or containers?

To me, the serverless story is very appealing. Today, I don’t see a strong use case to move everything to serverless, because I currently see that there are some things which are not yet solved. The compute thing is still a problem. The statefulness of serverless still has some quirks on it. For example, if you need to have like a persistent database connection, that’s something that you cannot do, because they could eventually scale to zero. Since I’m not working on very sensitive systems, I don’t care that much about cold starts, so to me that’s not a problem, but I know that some people complain about that. So even though I agree that I would do a lot of things in serverless, I don’t think we are right on the spot to basically fully translate to that. But I do believe is going to be disruptive in a very short time, about how people think about shipping and building and packaging apps. And it makes a lot of sense to start thinking of apps as a combination of Lambda expressions, of like basically different sets of logic that get connected to each other, instead of like a shippable binary that you build into a thing, and then have to deploy.

Interesting. Would Kubernetes make your cut?

The way it is designed today? No. I think that’s one of the – I mean, that’s one of the biggest challenges in platforms, is that when the foundational basis changes, it’s very difficult to adapt. OpenStack, even though it was designed to be on-premise, was designed to be VM-based. Kubernetes has been designed from the ground-up to be container-based, and all the components around it and all the concepts around it are around pods, and nodes, and containers, and sidecars, and all that. Even though I agree that Kubernetes is a very flexible platform, I wouldn’t call it an orchestrator. I think people are more accurate in calling it an OS, because it really provides a foundation. You could eventually come up with serverless resources. I mean, you currently have projects around it, like Knative… It helps on that. It creates the abstractions to handle serverless workloads.

I don’t know, to me the whole system is designed to be very container-aware, that it’s going to require quite some effort to basically try to abstract it and make it serverless-native; I don’t think it’s going to cut it, but I might be wrong. I guess we’ll see what happens.

[01:11:56.03] So it’s very hard for me to not start recording the second episode with you right now… Because this is something really interesting, and I’m sure we could dig into it. Unfortunately, we do have to stop on this occasion. And the last thing which I’d like us to cover is what are you most looking forward to in 2023? …since this is one of the first few episodes.

As personal, general, technology platforms…? Is there any context that you would like me to scope the question to?

I think it would be technology. I think that’s what the listeners will most resonate with. But if you want to share a personal, I’m most looking forward to. Go for it.

So I envision a 2023 where you can safely run CI/CD pipelines locally and in the cloud. That’s what I’m currently working on to help developers about. That’s one thing that I would like to see happening, I would like to see more WASM and serverless adoption, and use cases of companies using it for production workloads, where you can see that it actually – I wouldn’t say scale, but you could actually see it as a viable paradigm to basically build an app on top of that. I haven’t seen too many serverless-based products so far.

Of course, I guess we can’t deny the AI effect, right? So there’s gonna probably be a lot of tooling around that. For the platforming space I’m not sure where the biggest advantages are going to be. I don’t know if it’s going to be cloud costs, or some sort of like Copilot-based thing… You can probably say to ChatGPT today, “Hey, I want to provision a database in AWS using Pulumi and TypeScript”, and it’s going to probably do it today. So there’s gonna be a lot of that as well, like prompt engineering.

And I don’t know, the last thing that I would like to see is – one thing that I miss a lot, that I would actually like it to come back, is some sort of meetup thing, right? I believe that, given the past two years, a lot of human contact has been lost in communities, in people sharing knowledge. We all became a little bit salty on things, especially on social media and all that… So I would love to see something that brings people together. I don’t know, I really miss the early days of the excitement about a new technology, about something that is going to make people’s lives easier. So it would be nice to see something coming up to helping that problem as well.

Okay. Those are all great things to look forward to.

Yeah, we’ll see what happens.

Speaking about people, and speaking about coming together, today is exactly one year since I joined Dagger. So I’m very glad that the 4th of January 2023, when we are recording this, I get to share it with you. Thank you, Marcos. I really enjoyed it. Thank you very much.

Happy new year, Gerhard, and it’s a pleasure to work with you every day. We are building some really cool stuff together, and the most important part is that we are learning and we are sharing. It’s all about that, right?

Yeah. Likewise. Same here. I can hardly wait for the next time that we get together in-person. It has only happened once, and that was cut short. And yes, it was COVID, unfortunately… But this year, it’s gonna happen again. Not the COVID part, just the getting together part. I would much rather not have that again… But we never know. So once again, Marcos, thank you very much…

I promise that you’re going to be able to cycle the Golden Gate Bridge. It is going to happen.

Yes. That is on my list. Still on my list.

Take it for granted.

Alright, Marcos, see you in the next episode.

Always a pleasure. Take care.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. đź’š

Player art
  0:00 / 0:00