Ship It! – Episode #44

Fundamentals

with Kelsey Hightower

All Episodes

Today’s conversation with Kelsey Hightower showed Gerhard what he was missing in his quest for automation and Kubernetes. The fundamentals that Kelsey shares will most certainly help you level up your game.

This is a follow-up to the last 45 seconds of the Kubernetes documentary.

Oh, and we finally cleared where we should run our changelog.com PostgreSQL database 🙂

Featuring

Sponsors

MongoDBAn integrated suite of cloud database and services — They have a FREE forever tier, so you can prove to yourself and to your team that they have everything you need. Check it out today at mongodb.com/changelog

OpenZiti by NetFoundry – Programmable network overlay and associated edge components for application-embedded, zero-trust networking. Check it out at netfoundry.io/changelog

RaygunNever miss another mission-critical issue again — Raygun Alerting is now available for Crash Reporting and Real User Monitoring, to make sure you are quickly notified of the errors, crashes, and front-end performance issues that matter most to you and your business. Set thresholds for your alert based on an increase in error count, a spike in load time, or new issues introduced in the latest deployment. Start your free 14-day trial at Raygun.com

RewatchRewatch gives product and engineering teams async superpowers and helps them move faster with greater clarity. Imagine all of your team’s videos, all in one place. Record, organize, and share the videos that your team needs to ship great work. Get started for free with 14-day trial at rewatch.com.

Notes & Links

📝 Edit Notes

Transcript

📝 Edit Transcript

Changelog

Click here to listen along while you enjoy the transcript. 🎧

The last event that we were both at in person was KubeCon North America 2019. That was a crazy good event, and we almost had a recorded conversation, but timing was off, so that never happened. You have no idea how glad it makes me, Kelsey, to welcome you to Ship It today.

Awesome. I’m happy to be here, looking forward to the conversation.

So I wanted to talk to you about this for years - literally, years - about the Changelog.com database running on Kubernetes. Do you still think it’s a bad idea? Because I remember you saying you shouldn’t run databases or any stateful services on Kubernetes; just use a managed service and you will save yourself a lot of pain. Has that changed over the years?

So let’s think about context… I used to work in large enterprises, financial services that have petabytes of data in the database. They can’t play games. They can’t have it go down for 2 or 3 minutes because a DaemonSet or a Kubernetes upgrade. They can’t do that. This is people’s money; the cost is very high, and the return on value is very low. Typically, when you think about a database with a custom file system, you need all the bandwidth you can get, so we use custom network cards; you’re trying to do everything you can to get the maximum amount of performance. You’re not optimizing for convenience, you’re optimizing for performance. A lot of people have never run a database like Oracle or Db2 in an environment where there wasn’t redundant RAID controllers, redundant network interfaces. You can’t take any compromise. It’s not even about saving money. You will pay 3x to make sure that the database is always performing well.

[04:13] So if you think about moving that to Kubernetes, what value are you going to get? You could say “I could now put my database in a container image.” That probably wasn’t your number one problem, at all. You could say “I could use a Kubernetes manifest to deploy it in case when the machine goes down”, but the truth is most companies can only afford two or three of those machines that are qualified to even run this kind of database. So there’s never going to be a world where you’re gonna have a hundred of these machines around, and Kubernetes can just pick one.

So in that kind of world, you know what two or three machines is going to run on. Just pre-provision. You don’t need all of this failover capabilities, because there’s nowhere to failover to. So in those worlds, if you spend the next 3-4 months trying to make a Kubernetes operator for Oracle, now you’re really wasting time, because a thing you can’t afford is if there’s one misstep, one miscalculation, the pod restarts and it gets a new IP. Now what? You have to go update the repla config. And I know what you’re thinking. “Okay, now I can just go build a tool that when the IP changes, it will go and update the config.” What are you doing? These are diminishing returns.

Now, if you’re talking about Postgres for a WordPress website, or you have 100 gigs or so of data - look, if you go down for a little while, you’re probably going to be fine. Maybe you don’t have a big operations team, and you can benefit from some of the automation that Kubernetes introduces to that stack. Fine. Do it. Maybe one day the industry will figure this out, but I think for some use cases we just need to be honest and realistic that you’re probably gonna have to turn off so many Kubernetes features just o make sure that that database runs properly. So that was my advice to people that have serious, serious workloads, that cannot take these kinds of risks.

It’s so easy to read a tweet, maybe do a bit of research, spend like maybe an hour, and realize “Okay, I know it. The conclusion is don’t do it.” And then you miss on so much of this nuance which is there; it changes, it depends. And even that “Depends” - it changes on so many factors. So whatever I tell one person “It depends” - well, okay, they had like maybe 90% of that, but the 10% is different, and that “Depends” again changes. So there’s so much there to be said about taking anything you read and saying “Okay, I know. I understand this.” Until you talk, like have a conversation, it’s so contextual, and I was completely missing a lot of the stuff that you mentioned. So Twitter, can we have (I don’t know) 10,000 characters, please? …as if that’s going to fix it.

I think you’re exactly right, and this is why I like to default to the safe part that requires you to do research to prove me wrong… Because if say “Oh yeah, Kubernetes is ready for primetime. Go run your databases”, then people default to that, instead of the other thing. So yes, that context is important.

So when we started, I was thinking “Okay, I read and understand what Kelsey is saying.” I thought so, obviously. I didn’t; I was still missing some details. “Let me try this anyways.” Two years later - “Okay, no. This is not going to work.” It’s like, there’s no operator… Okay, there’s PostgreSQL, it’s simple… Still, we hit so many issues. We had three downtimes for two separate reasons. There was a Kubernetes operator, there was replication, there were block storage and all that… So everything was there, and it was kind of working, but when it didn’t, when we needed it the most, everything just blew in our face. And it’s like a fairly simple thing. It wasn’t Oracle or Db2, which is a fairly complex beast.

So what we did is we just went simple. You’re right, we don’t have more than even like – I think at this point maybe it’s two gigs of data. It’s not that much data. So… Single replica, stateful set. The easiest thing is if there’s a problem, restore from back-up. It takes maybe a minute to get all the data from back-up, load it up and run it. If there’s some data loss, it will be maybe 30 minutes, maybe 40 minutes… Which is okay. We don’t have that many rights. It’s not a problem.

[08:17] And the simplest thing is almost like – we’re using it almost like SQLite. It hasn’t failed in a year. It’s fine. It’s fine. It is PostgreSQL, but still - simple, single instance, all that. So that was my learning from that - keep it simple.

In that use case you should be using probably - I’m assuming that you’re in the cloud, given that limited amount of data, that few rights, you are the perfect use case for a managed SQL offering.

I get you might wanna earn your stripes in the open source community that you’re running it yourself, but - come on… This is why managed services exist, this is why we don’t run our own DNS server, this is why we don’t run our own email server, unless you have some really serious requirements. But what you’ve just described, which is the case for actually probably the majority of the world - what are you doing? Spend $15, $20, and don’t even think about this particular problem. You’re still in the open source world, but is it really worth all this mental overhead to make sure that it runs in Kubernetes? And that’s the other essence of that particular comment, is that - if that is not the number one thing of value for you, just use the managed service… Because Kubernetes is really great, in its own right, for what it’s really great at, but there’s nothing wrong with combining two services together and focusing on the protocol.

I hear what you’re saying, and I was thinking about it, “Why am I not doing this?” And the thing that used to come up - it’s not as relevant now, but it used to be in the past, like 2-3 years ago… It was “Everything I have, I’m trying to keep in Kubernetes, so that I have this API through which I can decalre everything, and everything is there.” The IaaS that we’re using for our managed Kubernetes didn’t have a managed PostgreSQL. So then we have to figure out how to interact with this managed database via Kubernetes, so that we have the same API, we can declare everything, rather than having this Terraform run, or Chef, or Puppet this, or Puppet that… It doesn’t really matter what; the point is, keep everything self-contained. So how do you keep everything self-contained, rather than starting to spread things, like DNS you do some clicks here, and for your CDN you configure it this way, and then before you know it, your whole setup is spread across everything. Crossplane is helping with that. And not just that. That’s just one example. Tooling has evolved over the years, but when we started, it wasn’t there. And I think it’s getting much better now, and I think the proposition is better. But do you have a specific database in mind that you’d recommend? The one that maybe you use, that you had a good experience with, like a managed one that you can declare and manage from within Kubernetes, but it’s still managed?

So this is the danger… It doesn’t have to be managed by Kubernetes for me, because that hammer/nail approach is no good. Because whenever we limit ourselves, what’s the point? If Kubernetes were to say “We should do everything via the Docker API” - remember, that’s what Docker Swarm tried to do. “Let’s do everything via Docker. Let’s not think about anything else”, so therefore they couldn’t build this system like Kubernetes. Kubernetes is here, it is really great for what it’s good at, but it doesn’t mean that no other system should exist, there should be no other thought in computing.

These other APIs, when you think about provisioning, databases will soon be like CDNs. You’re gonna want multi-regional, highly-replicated things. You’re not gonna wanna be tuning a single server. That’s gonna go away. But you’re gonna care about the protocol. So if you want the Postgres protocol, you’re gonna find a service that can deliver the Postgres protocol with the SLA, and other attributes like performance, that you want.

Think about CDNs - do you install a CDN inside a Kubernetes? Are you really serving media files from Kubernetes? There’s a cheaper way, and you do that via a CDN today. Now, if you really like the Kubernetes style configuration language - yes, Crossplane is one way to say “I want to use a Kubernetes style API to configure these other resources.” But even then, I would probably argue, “Do you want those to be one and the same? Do you want the compute cluster to also be hosting a configuration management tool?” Remember, this is an implementation detail that is leaking.

Ideally, what you want is an API server and a place to run a control loop, and let that be your Terraform-like thing that happens to be based on Kubernetes. There’s no reason why we need to stuff all of these control planes in a single cluster, because what happens when the cluster API goes away? Not only is your compute down, your configuration engine is down, and then people are gonna be like “Why would you do that?”

[12:32] So that would be my advice - let’s really focus on protocols, and pick the right protocol for your business. If you need a CDN, pick a CDN. If you need great networking, maybe your cloud provider has something. Or maybe you prefer Cloudflare. Make all those decisions and then step back and say “Okay, now that I have the perfect set of tools, how do I configure them?”

I think as an industry we’re moving towards “Wouldn’t it be nice if we had a declarative API for all of these things?” Because when I started in tech, there was no great API for most of my tools, so something like Terraform was never going to be an option. Even when I worked at Puppet Labs, we used to try to put a Puppet agent on things like switches in order to give it a programmable API at some of the layers that were not available before. And so now I think where we are in 2022, most of these things have great APIs. Are they Kubernetes APIs? No. But tools like Crossplane, or even Terraform can take care of this universal configuration for us, while we focus on using the right tools for the job.

That’s very interesting. That basically makes me feel good about some of the decisions in the back of my mind… And it makes me believe that “Okay, I’m going in the right direction.” You’re right, combining everything; makefiles maybe is not it… I had a period that I still have, where - you know, like, that’s like the starting point. You have a makefile, it gets your kubectl locally, it pulls all these credentials from different places, whether it’s 1Password, whether it’s KSM, whatever it is… And it’s the glue that holds everything together.

And then you have something like serverless. Why aren’t we using serverless? Well, I don’t know… Let me just figure my database first, and then I can start thinking about serverless. This is all very interesting, because it comes back to the first reason why I reached out to you, the initial reason why I reached out to you recently, and that was the Kubernetes documentary. At the very end of part two, the last minute, actually, you end up saying – you said something that really resonated with me, and you mentioned how first of all there was no zero-sum game between the containers, and the container orchestrators, and so on and so forth… And the best idea is how they consolidated into Kubernetes, which is just a checkpoint. But there’s going to be something that will replace Kubernetes. I’m not sure how far away we are there, how far along we are, but it really caught my imagination, and I would love to trade what I imagine, with what you imagine. So what do you imagine?

Well, you know, there’s this phrase, “It’s easy to predict the future when you’re working on it.” A lot of us at Google Cloud are working on that replacement. When I first joined Google Cloud from CoreOS, lots of people were running their own Kubernetes clusters across a bunch of virtual machines. This was the norm. And then we had a managed service, Google Kubernetes engine; you can now click a button, and then we would automate all the cluster provisioning, and the upgrades, and so forth. But the nodes were still there. You still saw the machines, and it looked like the cluster you would manage yourself.

And then we came out with tools like Cloud Run, that was based on the Knative spec. And for those that are unfamiliar with Knative, there’s this idea that if you looked at just running a very simple application inside of Kubernetes, you need a deployment object, you need some secrets, some configs, you need a load balancer, ingress service, pod disruption budget, a horizontal pod autoscaler just in case you need to scale, and you still didn’t have things like scale to zero. So we looked at that and said “Okay, what is the pattern that the average person running applications in Kubernetes needs?” So things like Knative was born.

[16:04] Now, if you look at Knative and say “Well, if you think that this is the right abstraction, then you may not need the rest of Kubernetes to run applications in that style.” So now what we can do is have a managed service that’s a serverless platform, cloud-run, we take the same containers, we can take the same Knative spec, and just make Kubernetes disappear. There’s this myth that we are running Knative or Cloud Run on Kubernetes. We are not. There’s no Kubernetes cluster involved, at all. All we do is we allow you to give us the specification at the API layer, and we translate that into our native API internally, so there’s no need for a Kubernetes cluster.

And even the people that need maybe more of the Kubernetes API – so if you just think about APIs and not implementation, then things like DaemonSet, and jobs, and all these other things, you can get that via GKE Autopilot. Again, we make the cluster disappear, but we give you more of the complete Kubernetes control plane.

If you fast-forward this over the next ten years, a lot of these patterns will be available in other platforms. For example, Cloud Run has in beta right now (or alpha) the ability to run Kubernetes jobs.

So I think that people have proven that Kubernetes has been a great prototyping ground for workload definitions and APIs. And if you really believe in promise theory, then we know that those APIs and specs are just the promise that you’re making, and any system should be able to converge or keep them. So this is what I mean by something in the future will replace Kubernetes, because we are actually building the ability to.

Okay, that sounds really fascinating. I will have a follow-up question, but I promise to trade. So this is what I’m thinking… Whenever you get a Kubernetes, you think it’s the same, and it mostly is, but not always. There’s differences between IaaSes. In some cases, you would use, for example, a persistent volume claim. But that persistent volume claim will be so different between IaaSes that maybe it is the wrong thing to pick. Initially, in our case, we couldn’t pick SSDs for our persistent volume claims, which made anything that was using them really slow, depending on reads and writes. Networking - that keeps coming up a lot. Networks are so different between IaaSes, and it becomes very obvious when you have the same Kubernetes version, but the behavior is so different. Everything is the same. So while it behaves the same most of the time, there are certain differences which are really difficult to work with in failure scenarios or in bursty scenarios, or stuff like that… So then what are you doing? You’re just feeling good about what you’re using, until stuff just breaks, and you say “Ah, I wish I had chosen something else. What am I doing…?”

But it’s not so black and white. It’s not like you can use it or not use it. There’s good things and there’s bad things in it. So what I’m thinking is, as you mentioned - is it just like the hammer, and we just go with the hammer and nothing else? What if we use Kubernetes as something else? What if, for example, we use a platform? Because if you have a CDN in front of your app - and this is like a monolithic app - could you be deploying to Kubernetes and to a platform? You know, sometimes the platform works, and it’s good enough. But then how do you manage certain things which in a platform are more difficult to configure, like for example - I keep coming back to mixing runtime concerns and configuration concerns for your infrastructure… But there is something to be said about having everything in one place, and being able to see it, and understanding how all the pieces fit together. Cognitively, there’s less overload.

So could you have this world where you mix and match different runtimes, and everything just works, and it’s more homogenous? It doesn’t really matter where it runs, and almost like the first one to serve wins? So if Kubernetes is fast, that’s the first request that wins, and that’s it. If you have another origin in your CDN, which is much faster, then that’s the one that wins. And that’s the end of it. But I think that sounds very complicated to get it working. How do you do application updates? How do you manage upgrades? Now you have 2-3 things to upgrade, maybe, if they’re not managed, and then you still have the database problem. You have to move it somewhere, and then you have to connect to that… So I’m not sure whether that’s better, but I’d be curious to find out what does that look like, and work like.

[20:31] Yeah, I think computing is still very immature. We’re coming from a world where you write apps, you put them on a server, you make a bunch of system calls, and you write some scripts to point them at different servers to make them do things. We’re 30 years still doing that. At some point, you’re gonna get a different type of machine, and that machine is going to not be concerned so much with you doing all of these custom things to articulate what you want.

Think about the iPhone, for example. You don’t mess around. Here’s the SDK, here’s the distribution channel, and then everything else is kind of built around the machine. So the iPhone itself feels like this fully integrated thing that is very wise about what to do, and when security needs to be implemented across all the apps, there’s a common security framework in many ways built into the machine. I have never installed a dedicated security agent on an iPhone…

Yeah, that’s right.

…because it’s a different machine. It’s a better machine than the type of ones that we use today, that are very generic, unwilling to make too many assumptions, because they’re general-purpose.

I think the next set of platforms that will show up will do some of those things. I think about Cloud Run… Cloud Run has multiple runtimes. There’s one that is a little bit more security-focused, and it’s backed by gVisor, which emulates the kernel, but limits the system calls. We also have the V2 back in, that is more like a VM-based type of architecture, but really lightweight. If you’ve ever heard of like a firecracker from Amazon - similar technology at play here, but it allows you to do things like NFS, and possibly mount GPUs, and other things in the future… But we decide which one you need based on how you wanna use the application; we give the user the power to choose. But it’s the same API.

So I do think in the future, one day you will be able to – and actually, we already see this. There’s a company called Vercel, and they are the company behind things like Next.js. But the marriage between their CDN and compute platform in the framework itself, when you’re defining your logic, there’s parts of things you can build. I think they call them something like durable functions, or something… And if you define one of these functions, they detect that and they will turn that function; instead of it running in the browser, they’ll move it into their compute layer, and then wire everything up on the frontend to make sure that it communicates in this backend. So that’s a great example of the platform being aware of the application needs, and making adjustments at deploy time. So I think we already have hints of this future today.

But does that mean that you need to pick platforms based on features, and you go to Vercel because they have this? Or you pick something else and then you miss on some other features that for example Cloud Run has? Is that what this means?

Yeah. That’s the price of innovation. You could wait ten years until everyone has it, and it’s standard, and spend all your time kind of building these features yourself… Because think about what really happens. Even in the world of Kubernetes, most people are just building these features themselves. And so you become the future lock-in. You have all of these scripts and tools you build, and the team relies on these tools that you’ve built, so when a new platform comes out, the team is like “Hey, we really can’t use that, because we’ve built all of this, we don’t wanna give it up. We’ve tuned it for ourselves.” So guess what - you might end up ten years from now stuck on that platform. And then someone else will be doing a podcast like this, talking about “How do people get out of these custom-built things that are no longer adding as much value as they used to?” So we always have to pay attention to how the world is evolving, and how we’re evolving in our own view of the world.

[24:09] Let’s imagine that you have a tech business, and you have to figure out where to run it, where to run all the tech stack. What would you pick?

These days I’m probably going to default to something that is container-based. And that doesn’t rule out things like Amazon Lambda, because the way I look at things like Lambda, which is a functions-as-a-service kind of event-driven platform - when I really look at it, especially now that they support container images for their packaging, I look at Lambda like I look at Ruby on Rails. Lambda is a framework for building applications, and if you use it, it is no different than using JBoss or WebLogic, where they say “Hey, we have an opinionated way of routing data to your application.” Some of those frameworks allow you to write just simple handlers, and ignore the rest…

So now as a technology company that’s going to be building services, number one, I definitely want something container-based, because we’ve proven - and I even have a GitHub repo - that shows you how to take a Lambda function and run it on Cloud Run. Because Amazon has a shim that would allow it to run as a normal HTTP server by just changing one of the entry points in your container image. So we’re gonna go with containers by default, and that opens the world up to platforms, or even low-level Docker on a laptop.

Next I’m going to think about my language framework. Do I want gRPC? Do I want something that’s REST-based, or do I want something like Lambda? But either way, I’m gonna be thinking about portability. I have business logic, and I’m gonna probably write that the same, regardless of the extremes.

But then, when we talk about protocol, I’m probably gonna have to support HTTP and REST. I do like the value of gRPC; it turns out you can do both of those in the same server, and they both call the same business logic. So that’s the way I’m gonna be thinking, because we have so much experience from the past that there’s no reason to go all-in on a single protocol. If you’re smart, you can just have an adapter pattern and support Pub/Sub, REST, WebSockets and gRPC, and not necessarily have to rewrite everything.

So that would be kind of my mindset going forward, and then I’ll just pick the most convenient platform. Maybe today Kubernetes is going to give me everything I need in terms of GPU access, custom hardware, but I won’t be afraid of platforms like Cloud Run when I need to just run some APIs that’s talking to a database.

The database will be managed, we’ve already established that, right? That’s for sure. No two ways about it. Will you be using a CDN?

Yeah, I think your customers expect you to use a CDN at this point. No one is going to accept slow media. That world is over. Unless you have a very niche audience, and you have some very exclusive content that they’re willing to wait however long… But I think CDNs now are so cheap, so prevalent, so easy to integrate and use, that I think everyone in the world just assumes that they’re gonna get a fast experience from the frontend, no matter where you are in the world… And I just think, honestly, the only way to really meet that bar is to just use one.

Okay. So if you’re using serverless, does a CDN still make sense, or would you go with something like Edge Functions? What are you thinking there?

Honestly, we have to think about the – I always like to just start from the end user. What is the end user interacting with? And you would say “A web browser.” What loads first is typically the web page, or some framing of the web page. And thanks to things like Next.js and React Native, we know now that we build modular frontends that can then have components that talk to some backend. This is a really nice pattern. We see this pattern in mobile devices as well. The presentation layer is always close to the user, and so it’s the presentation layer that we wanna get to the edge as fast as possible. So we split it up.

So now we ask ourselves, “What is the logic necessary to support the experience?” If everything can be loaded from the CDN, then there’s nothing to do on the backend. I mean, we’re done here. But that’s typically not the case. People need to log in, there’s gonna be profile pages… And so now we wanna scope down to the smaller logic, meaning “Where am I loading that profile page from?”

[28:13] Now, to answer your question, you have all these edge functions… To answer the question “Is an edge function the right use case when I need to load a profile page?”, the answer is going to be “Well, does the edge function have access to the database?” Where is the database? If the database is on another continent, having an edge function isn’t buying me very much, so I might as well have - even though I’m spread across the world, I might do things like cache to multiple geographies, to make sure that read-only data is very fast… And maybe I route write semantics to one region where the data is close, and I can ensure that it’s going to be replicated to a failover site.

So when I think about the architecture, I think what most people will find is that serverless, or even something in Kubernetes, is probably great for the heavyweight business logic that probably needs to run more than a few microseconds or seconds. But then edge functions might do simple things like, say, a request comes in and you wanna evaluate the header, and depending on what happens, you wanna redirect that traffic to another region. Great.

So I think about edge functions as almost an extension or a plugin system to the networking stack that we’ve never had before. So that’s the way I think I see a lot of people using these edge functions today.

Break

[29:30]

With this architecture that we have so far, all the distribution, the database, the CDN, maybe some functions at the edge, maybe, but not too many, and maybe there’s a Kubernetes somewhere - how do you manage updates to all those things? You don’t have a big team, operationally maybe you’re ten people; not that many. How do you, first of all, understand what you have running, where, and know when you need to do an update here or an update there, how it all fits together. Even code changes. We can start with code changes, because you no longer have a single app. You have like a distribution of things, and then maybe a change here also triggers a change there. How do you centralize and make it simple for everyone to understand what is happening where?

Yeah, I think we always have this desire as humans to aggregate things out of the sake of simplicity. If we put everything in the same configuration management tool, then it will be simple. But typically, when you look at that configuration management tool, there’s all kind of if statements and weird logic and extensions that only the person who made that understands, and everyone just says “Hey, don’t touch it. Just use it as it is”, and now it’s very brittle.

I think the thing that does make things simple are things like “Here is the frontend. The frontend has these components, and it retrieves this data from these imports. It doesn’t retrieve this data from functions, it doesn’t retrieve its data from Kubernetes. It retrieves this data from end points.”

[32:16] So now if you’re a frontend developer, you can have endpoints be anything for your testing purposes. I could just run endpoints on my laptop and serve mock data, and the site will behave as designed. So now I say my deployment target for my frontend could by my local browser, or it could be a CDN that will then serve it to billions of web browsers. That’s simple to understand. So now you say “What tool will I use to complete that?” Well, it could be as simple as my IDE, where I hit Save, and it just goes to staging automatically, because I know what the contract will be. When it gets there, it’s going to use a certain set of endpoints, and those endpoints are required to have data.

So now let’s move to the endpoints. Any one other team can create endpoints. You make a promise, endpoint.hightowerlabs.com is supposed to serve this kind of data, and this structure. And if you do that, then my frontend app, my mobile app, my watch, my smart washing machine - all of them will know what to do with that data. And even though that sounds more complex, it’s easier for everyone else to understand that there’s going to be data returned from this endpoint. None of them need to know about Kubernetes, they just need to know what my contract is. You need these credentials to get this data, and I promise to keep it up 9.99% of the time. So now that we have these contracts, now I can go focus on my area and just assume you have your area.

So now let’s talk about deploy targets. Now, that API - we know we need DNS, and I could decide to point the DNS at Cloud Run, I could point the DNS to Kubernetes, independent. So I think these logical boundaries, these very clear contracts allow people to move very fast, because now we’ve reduced the amount of coordination that we have to do in order to figure out what to do next. So as a developer, without talking to anyone on the endpoint team, I can just call Curl and see what data comes back, and I know what to do going forward. So now if I wanna use different tooling for the Kubernetes component - and think about this… You don’t necessarily need to provision the cluster yourself. You could decide “I’m just gonna use GKE Autopilot, which will automatically scale the nodes across the regions based on what my application needs.” So my Terraform provisioning is “Enable Autopilot.” That’s it. No flags, no cluster side, no node pools.

And then maybe another tool takes off to manage things like deployments, deploying to that particular target. To me, that is easy and understand and explain to the relevant parties. And you’re right, if one person had to think about all of these things, we’d think they would have to actually learn a lot, but I don’t think that is the case. When I context-switch, when I go to work, I work on a Chromebook. When I’m doing personal projects, I work on a Mac. If I wanna play a certain video game, I do that on Windows.

Humans have no problem switching context when the rules are clear, once I make the switch. It’s only when we start to muddy the waters by mixing too many concerns. If the frontend team is using Terraform in a way that doesn’t actually make sense, then they get confused and now they’re spending so much time battling Terraform, instead of just saying “Save in the IDE, run some tests, if they pass, deploy the new set of static files to the CDN”, and be finished.

Okay. And where do you capture the configuration which knows – do you centralize it somewhere, or every team has its own context which says how this thing goes out? I’m trying to figure out the layer which makes things happen. As a developer, I only need to know that this thing will happen. But what is that layer that actually makes it happen?

[36:08] So to me, I’m really big about workflow. So let’s think about developers that don’t work in our company. They need documentation. So if I go to the GitHub API docs, it will say kelseyapi.github.com, you need to log in this way, and here’s the data that you will get these endpoints. So now I see that this is the contract. If I’m using Puppet, Chef or Ansible, I can copy that URL and say “When you deploy my app, tell it to use this endpoint.” Okay. But I need the documentation first, in all use cases.

Now, depending on the tool, let’s say we’re doing something simple like CI/CD. Then the CI/CD system needs to know this configuration detail if we need to be able to have a dynamic value depending on their target environment. So then I think the question you’re getting at is where is the source of truth. Because we know the CI/CD system isn’t the source of truth, because the CI/CD system isn’t the owner of that particular endpoint, so it cannot be the source of truth.

So in that context, if we’re at the same company, I might decide to put all of this in some configuration store like Consul or Etcd. Some people just use a spreadsheet or a wiki to say “If you’re in this dev, this is the URL you should use.” And I might go to Ansible and say “Hey, Ansible, if environment = then use this URL.”

So I think for humans, it’s all about two things - source of truth, and then how do we tell the tools that need to use it about the source of truth. If you wanna be 100% programmatic, you still need the documentation that says “This is the value, for these reasons”, so I can understand.

Why is this endpoint the right one for staging? Because as the team that manages this endpoint, we have chosen this URL, for these reasons. Great. That is the actual authority or source of truth. Then we go and say “Alright, if you’re using Ansible, here’s where you put it in your inventory file or your playbook. If you’re using Kubernetes, put it in a config map, and here’s how you reference it between environment variables or command line flags.”

So I think that’s the way I think about it. So to me, it’s not really about a universal place, because I used to work at Puppet Labs and we had this vision that everything will be in Puppet. It didn’t make any sense, because it turns out that Puppet is not the source of truth. People are. As weird as that sounds. If the oracle team changes the database password, it doesn’t matter what Puppet says. Everything is now broken.

Yeah, for sure.

So that’s the way I think about it.

And when it comes to people capturing the source of truth, where would they capture it? Would they just write documentation? Is that it?

I think if we’re being very professional about it, we should write the documentation. Then I think when it comes to things like secrets, things that we don’t wanna put in plain text… You know, I work at a place like Google, where we do have these central secrets storage things that allow you based on your credentials to fetch secrets that you need. But more importantly, we also teach the platform itself how to fetch the secrets that it needs. So role-based access control to a configuration database is probably a good next step. This is why things like Zookeeper and Etcd and Consul have been made very popular.

In the Kubernetes world, this is where people again start to mix concerns, and I see them get into trouble. If you have a single Kubernetes cluster, you say “Okay, we have these URLs, and we have these passwords”, so you put it in a secret, and you put in a config map. In your mind, everything works great. If someone says “Hey, we should have a production cluster and we should have a dev cluster.” So now where do you put in the credentials? You have to decide. We know we can’t put the exact same values in both environments, because that would be wrong. We don’t want dev pointing to the production database. So mentally, we all in our minds say “Okay, these values go in production, and these values go into dev.” Great. So you just say kubectl apply and you put them there.

[40:00] So how do we know that those are the right values? You have no idea. Kubernetes is just doing what you told it to do. It still isn’t the source of truth. So to me, in that case some people decide - and I’ve seen people do this, and I actually like this pattern - “Let’s have a configuration database that’s managed by RBAC controls. In my keyspace - let’s just say you’re using something like Etcd, and it has key-value pairs - I can say “This path is for production key-values. This path is for dev key-values.” The keys may look the same, but the values may be different. So I can put all the right attributes there. I can now build tooling to do things like diff between the environments. If you wanna see values in dev, try your credentials. If you have access, you can read the values.

I think Vault is another use case, even though Vault is optimized for secrets. You might wanna use it for your universal store, because it has some nice management properties around creating dynamic secrets and rotation. But either way, I think we get the point - having a database where you can centralize these authoritative configs for all environments. Once you have the data there, now you can teach the tooling to synchronize.

So imagine a world now where you can have a cron job that says “Every one minute…” - so this is the foundational promise theory. “Every one minute I want you to go to the authoritative source of truth and make sure Kubernetes agrees.” So even if someone types the wrong value in Kubernetes – maybe you don’t even give people access, ever, to do write operations in Kubernetes for application secrets and credentials and passwords and configuration. You say “Do not touch that. That’s read-only. It’s a read-only cache. The social truth is here.” Then people say, “Well, Kelsey, how do you get the secret into Kubernetes?” I say “Okay, this is why we have a background process that reconciles the source of truth and puts it in all the right places.” Some stuff will go into CI/CD, some stuff will go into Kubernetes, some stuff may go to your local laptop, because you wanna troubleshoot a production bug, and so you actually need to be pointing to a production endpoint. Maybe not a database, but an endpoint.

So imagine a tool that says “Hey, make my environment look like production. Grab the right credentials and put them in my local config file.” That’s the way to think about it fundamentally. And then at that point, Kubernetes is no longer such an important component; it’s just a place to store authoritative values.

I think I’m starting to get it, in that it’s less about the specific tooling and it’s more about the principles, the fundamental principles, that are more or less the same, even with the tooling changing. And while the principles are great, I still need to know where to store my secrets. Like, what is the tool, the source of truth where I can put my secrets so that everything else, all the other tools can get to them. Do you have something that you use for that?

I mean, lots of people like Vault. So Vault has a trade-off… Vault is gonna do a lot to optimize for secrets management, things like rotating secrets based on time, creating one-off credentials… So maybe you want a real-time database password generated based on the request at hand… So it’s optimized for an extreme use case of secrets management. And if you want dynamic secrets, Vault is great. You could probably even stuff regular configs in. I’ve done so. Once you have a secret as part of a config object, the whole thing now is secret. It’s been tainted.

You can get into a different discussion - some people believe that you should just have low-level key-value pairs at the very lowest level, and then it’s up to tools like Puppet, Chef and Ansible to take those values and put them in a template, a.k.a. a config file, and distribute those at deploy time.

So if someone told me I had to make a singular choice, then maybe I’m picking something like Etcd, that is kind of focused on key-value storage; I can encrypt data at REST, and I can encrypt secrets within Etcd itself. And then you have to have the key to decrypt.

[43:59] So that’s it. This is my authoritative place to put every type of configuration, for every system that we wanna use. Now I need tooling to tell other systems where it is. So if I’m using Puppet, we know Puppet has tools like Hydra, that can actually plug into other backends. I wrote a tool called confd, designed around these principles. Confd can say “Is the data in Vault? Is the data in Etcd?” Actually, you can mix and match. I had functions that said “This value comes from Vault, this value comes from Etcd.” And then based on that, I will assemble the template and then distribute it to the application. But this works because I have a clear source of truth.

And if confd is not your favorite thing anymore, Kubernetes comes out, “Guess what - you don’t have to change the source of truth.” You now say hey, when you provision a cluster, you might even have an operator that just synchronizes from Vault, and just sits there. You put in a – and I actually had a GitHub repository years ago to show people how to think about this. I had like a Vault operator, and what you did was you said “Here’s a list of keys that I would like to have replicated from Vault into this cluster.” And so my operator had credentials to Vault, would see these Kubernetes objects and say “Oh, you’re telling me that you would like these vault secrets synchronized into these Kubernetes secrets, so that the other apps could just work. Okay, I have you.” And then it just did this real-time synchronization.

I like it. I want it. Where do I get Vault to run?

So actually I have this project where I’m running Vault on Cloud Run, and I use that as an experiment to make sure that things always-on CPU were working. Because Vault scaling to zero makes no sense, because Vault needs to be able to timeout passwords and delete them from the system, or invalidate credentials so they no longer work.

So when you think about running Vault, this is where I also like to think about control planes versus data planes. Vault to me is a control plane. It has an API, and if you’re storing all your configs or secrets there - well, those secrets are universal truths that should remain the same regardless of deployment target. So if I move Vault out of Kubernetes, into Cloud Run, then I can tell all the Kubernetes clusters that this is where you get your secrets.

So to me, I like to run my control planes outside of the thing that I’m controlling.

It makes sense. And then that thing which runs your control planes - it’s self-managing, self-updating, in terms you don’t have to worry about the thing; it’s just a managed thing. All you have to worry about is maybe your control planes, maybe you need to update Vault, like upgrade the version, and then you do that yourself…

So do you have like some files committed somewhere in version control that you update and you see how they work? How does that mechanics work?

Yeah, so typically the Cloud Run config - I actually have it on GitHub. It’s called “serverless Vault on Cloud Run.” If you look at the commands, it’s just like this gcloud command that says “Use these parameters and this version of Vault.” Now, ideally, if you run that command again, Cloud Run takes care of the whole rolling update thing, and all those semantics. You could turn that into a Terraform. Terraform supports the Cloud Run deployment API. You could probably also use Crossplane if you really wanted a Kubernetes API. But either way, we know that there’s an API to allow you to update this container image and have Cloud Run do what it needs to do to rotate it out. But what we’re really talking about here is the lack of a managed service, because at some point you should be able to get a Vault endpoint and protocol from any number of providers. So the reason why I was doing this in Cloud Run is to hint at the situation. But really, what you want is to say “I want Vault version X.” That’s it. Keep it running.

[47:56] Pretty much. That makes sense. I think this changes a few things… Not too much, but it does, and I definitely have to go back and listen again to what you’ve just said, because I have so many thoughts, parallels, like fractals; it goes in so many directions.

Could I drop another good example?

Letsencrypt. Before Letsencrypt, when I started in tech, early 2000s, when you wanted a certificate, you had to do this song and dance - get the domain name, fill out a form, pay two, four hundred dollars…

I remember that.

They would tell you your certificate is ready, you download it, and you’d better watch, because it’s going to expire, and then you can do the dance again. So that was something that everyone thought that they would have to do forever. And then one day we get a protocol that says “Don’t worry about the provider, don’t worry about where it’s running. Now you can use the ACME protocol, and this thing will now automatically refresh, automate DNS provisioning, everything.” And now it’s run in over – I think they said half the internet is using Letsencrypt certificates, because of the protocol and the automation, and it’s a managed service, run by a non-profit. That’s a good example of taking a very common protocol, this ideal of minting a certificate from a certificate authority, going through all the checks to ensure that you’re the proper owner of the domain, and turning that into a new protocol, ACME protocol.

And even though it’s an open source project, you can go download the server yourself. I’ve used that server a lot in my tutorials. The majority of the world just says “Nope. I’m just gonna go use the hosted, managed version of that protocol, because I always have the option of running it myself when the time comes.” So this is where I think even – especially for open source tools, we already know we will have the option to run it ourselves, so the fear of lock-in should be reduced. And given that these things are open source, there should be a number of providers, from DigitalOcean to Google Cloud, to even a colo-provider. Why not have a hosted Vault instance that I can use, backed by maybe something like HSM to give me ultimate security… But this is where I’m thinking we’re gonna have to go in the future.

That makes sense. I’m still thinking that, even with Letsencrypt, we still have to somehow distribute that certificate to our CDN, because we’re using it in multiple places, and wherever you want to use that certificate, you either get multiple certificates, but in there are certain limitations, how different providers, they ask you to have certain DNS entries, and then sometimes they clash, because they’re CNAMEs, versus TXT… Anyways. It sometimes gets complicated along those lines. But I know what you mean about the certificates and how everything is so much better these days than it used to be in the past, for sure.

Let’s talk about that for a second. So if you’re thinking the previous way - yes, you’re gonna go try to get the certificates, download them and store them somewhere and distribute them. But as of even four or five years ago, Cloudflare just says “Don’t worry about that. You want Letsencrypt? Just click here.” Google Cloud - “You want Letsencrypt? Just click here. We will do everything. We have the domain, we know how the protocol works. We will update the domain, we will do all the things, we will call the API, we will get the certificates and we will actually put them in a load balancer.” There’s no configuration. So now the configuration turns into a checkbox. Letsencrypt = true or not. No “Go get it, put it here, do this, move this over here…” That’s over.

So this is what I mean by if we focus on the protocol and we get universal agreement on the protocol, we can reduce the amount of configuration management and automation that we have to do as custom one-offs.

Break

[51:30]

I’m still trying to figure out in the back of my head what you mentioned about not having a universal tool that combines them all, because it’s a fool’s errand, and how we should focus on the documentation side, on the contract side, on what we expect one another to implement, and how to use and how to work. Do you have an example of an architecture or a stack that works like this? Someone that’s maybe public, that shows how they do this… Because I would love to see how they document those things, and how that’s being used. A real world example. Is there such a thing that you know of?

Azure, AWS, Google Cloud. You go there. API documentation and GCloud command line tool, and Terraform support. So they have all the layers. They say “Listen, we don’t know if you’re gonna write your own tool or not.” Maybe you work at HashiCorp and you need to see all the API calls and all the return responses, and the rules about authentication. There’s all the docs. Now, maybe you are a developer. So - okay, we support Golang, we support gRPC, we support all of these things. So here’s the SDK to make that even easier for you to interact with the API. And then maybe you don’t care about any of that, and now you could just use Terraform, and we have a module that lets you represent those things as just something native, in Google Cloud in particular, whereas Amazon has CloudFormation. We even have tools that just let you directly declare what you want, and then we will hold the promises on the config. We tend to have all of these things because we have no idea what your entry point will be, or if you need custom tooling.

[56:04] So I’m thinking from the perspective of the end user. As the end user, which is making use of all these tools, and the end user has to figure out how to combine them, how to use them. But the end user in this case is I have an app, and this app has a couple of dependencies; it doesn’t matter whether it’s a bunch of serverless functions, or it’s like a monolithic app… It has a couple of dependencies on a CDN, on a database, on this, that and the other. Do you document that, so that someone can write the deployment for it? Because I’m thinking on the implementation side - I have this app, I have to get it running, it has all these dependencies… How do I make that happen? How do I capture that, what it means to run the app, so that the developer just git pushes, and anything that needs to happen behind the scenes, it does. The developer doesn’t need to care about that. Which CDN we use, or which database we use, and so on and so forth.

So if you think about Atlassian - they make JIRA, they make Confluence, they make these very popular tools people use. If you go look at the JIRA documentation - I used to run JIRA for a very long time, and I’ve seen it evolve over the years. They have documentation that says “This is JIRA. We need this version of Java, and we need to connect to Oracle, or Postgres. And you need to think about this Java JAR to connect to Postgres, because the default one doesn’t work well. Here’s everything you need to do.” And so with those raw instructions, I even have a GitHub repo that says “Here’s how you take those instructions and articulate them into Kubernetes.” When I was at Puppet, we had a JIRA Puppet module.

So that’s why companies like Atlassian have to start with the documentation, because they have no idea what your tool preference is. But we give you everything you need to automate everything.

Now, on the extreme, they have a managed service that says “Forget deploying JIRA. Click this button and we will give you a working JIRA that has a database, it has everything. You don’t need to do anything.” So everything in between that is you picking the tool of your choice to deploy to. I’m pretty sure there’s some Terraform JIRA modules somewhere, maybe even maintained by Atlassian. I’m sure there’s some Kubernetes configs floating around the internet that will try to provision all the things that you need. I’m pretty sure there’s a Helm chart that is an opinionated thing… Those things are in the middle, but the thing you asked about is “Where’s the thing that just says “Give me JIRA” and it automates everything?” Well, that’s what we called a managed service; that’s what we call a SaaS. And I think JIRA might even let you pick your region for sovereignty issues. “I only want mine in Europe.” “Okay. So click this button. You now have an endpoint in Europe that is 100% ready to go. Just focus on using JIRA. Oh, and if you ever need to back up your configs, click this button. And if you wanna run your own JIRA instance, import this data into your JIRA instance and you can continue on.” But those are the two extremes.

I see. Let me try again with giving Changelog.com as the example, because that’s why I have in mind, and maybe that is the missing piece. This is the last try. So Changelog.com is a Phoenix app, which is a bit like Ruby on Rails. It runs on Elixir as the DSL, Erlang VM is the runtime. It is a monolithic app, it does everything, single repo, monorepo, it has a PostgreSQL database, and it has the CDN integration, the DNS integration, certificates… Just a bunch of things; all that stuff. S3 for media storage… Whatever. Should you capture all of that in a way that is fully automated, easy to update, easy to iterate on in a way that is self-contained, so that the developers that work on the app need to know as little about as possible? Or do you document all of that so that people run, know how to configure their version of Changelog? Do you see where I’m going with this?

I see exactly where you’re going. My entire career, my role has always been “Document the manual process first. Always.”

[59:54] Because if you go and do everything in Puppet, now I’ve gotta read Puppet code to see what you’re doing. How can I suggest anything better? So if you write it down manually, and you say “First get a VM, install Changelog, then take this load balancer, put the certificate here, then get this credential, put it in this file, then connect to Postgres this version, with these extensions.” So now I can see the entire thing that you’re doing, and then the next thing I do is say “Okay, now that we understand all the things that are required to run this app, I wanna see the manual steps that you’re doing. All of them.” We build the app using this makefile; we create a binary. We take the binary and we put it where. You’re not storing the binaries into it? Oh no, we’re just making this assumption that we could just push the binary to the target environment. You need to fix that, that’s a bad assumption. You need to take the binary and preserve it, so that we can troubleshoot later in different environments, and we can use it to propagate. “Oh, okay Kelsey. Good idea.”

So we’re just gonna fix the manual process until it looks the best we can do for what we know at the time. Now, once we have that, I’m gonna give that process a version. This is 1.0 of everything; we’ve cleaned some things up, we saw some bad security practices, we’ve cleaned up the app, so now go automate that. But while that is automated, we’re gonna go work on version 2.

It turns out that all this stuff that we’re doing at the app level we should move to the load balancer level. Rate limiting, certificates… All of that. We should just move all of that there, and take it out of the app to simply the app. We’re also thinking that Prometheus might be a better thing than what we’re currently doing, so now we’re going to add Prometheus to the mix. And then we’re gonna test everything manually, and the upgrade process.

So once you really understand, ok, go automate that. So automation to me is not the source of truth. Automation is a by-product of understanding. Automation should be a serialization of understanding. So if you try to automate something you don’t understand, you’re just gonna end up with a mess. “Hey, why is this Puppet code written this way?” “I don’t know, man. It just works.” “Well, how will we fix it?” “I don’t know, man. I don’t think you should touch that.” Because no one knows how it’s supposed to work. So when we need to switch to a new tool, what do you do? You end up porting this mess to the new tool, because you think you need it all.

Every time I’ve done this, if I have clean documentation when Kubernetes comes around, I don’t need to look at the Puppet code. The Puppet code is of no use to me. So I look at the documentation and say “Okay, Changelog, guess what we need to do. We need to actually create a container image for this manual part of the step. Once we have a container image, I can delete everything below that, because now it starts to be kubectl apply.” That’s v2 of the doc.

The reason why I love that process is because now I have a way of testing to be sure that I haven’t missed anything… Because you know what happens in Puppet. There’s some issue in production, you make a patch to the Puppet code, it’s working now, and the only people who know about that are the few people who reviewed and made the change. Because no one’s gonna go and look at those very low levels to say “Oh, we didn’t know that this directory needed to have a change mod before you write this file, and then you’ve gotta change mod it again.” Something’s wrong with that process, and just because you use automation as a band-aid doesn’t mean the process is any good.

[01:03:08.23] So documenting the process is the first step before you automate it.

Yeah, you want a blueprint before you build the house.

Right. I see. So you can be an actual engineer, not just a developer, right? Engineers without blueprints - what would they be…? Developers, maybe. I don’t know. That is a very big food for thought for me right there. I think I need some homework to do there. Thank you. That was big, meaningful. It was definitely worth it for me to keep asking you until I got it. Hopefully, it wasn’t too annoying for you, but for me it was great.

As we’re preparing to wrap up, what do you think is the most important take-away for our listeners from our conversation?

I think the focus on those fundamentals… Because the reason we find ourselves as practitioners, as an industry as a whole constantly migrating between different platforms, journeys and digital transformations, we do this because I don’t think we ever understood the fundamentals. The fundamentals are very clear. If you’re thinking about application delivery, for example, we know what the fundamentals are. Ideally, we’re versioning our software that we need to build that software in a reproducible way.

15 years ago maybe you were just creating WAR files or ZIP files. Maybe you were sophisticated and you were creating RPMs or DEV files so you can use a package manager. Those have always been good ideas. And so we know packaging is a by-product or the artifact of assembling code in those dependencies and getting it ready to run. If you did that 15 years ago, adopting something like Docker is not hard to do; you say “Okay, we’re gonna take the RPM, I’m gonna put the RPM in a Docker file, and then package it up.” Just another packaging step. And if you decouple the packaging from the deployment, then you get the ability to change just the last mile.

So even if you’re just using RPMs and then you add container packaging, and even if you’re using something like Puppet to deploy those RPMs, in parallel now you can actually just swap out the Puppet step for the Docker step, and it works. But you have to understand the fundamentals and the boundaries between these concepts.

I think as an industry we’ve been pushing “Automate. Automate. Automate”, and we haven’t been saying “Understand. Understand. Understand.” Because if you understand what you’re doing, you can automate if you want to. And sometimes, I’ve seen teams where maybe you don’t need to automate as much anymore. Because if you really have a clean process that says “Okay, we’ve automated the production of a Docker container. And look, we don’t really have more than two environments.” So QA says “Docker run this container image. It works. We go to production. Docker run the same container image. It works.” And maybe that’s all your team needs to do, and - look, that’s okay. But I think understanding allows us to make that decision, and then we can decide what tool is the best for the job. So that would be hopefully the best takeaway today, is that these fundamentals can be applied by different tools. And the tools are not the fundamentals. It’s the ideas and concepts that we have been talking about today.

Kelsey, this has been truly eye-opening to me. I wasn’t expecting this. I wasn’t expecting it to be as good as it was. Thank you very much for your time. It’s been an absolute pleasure having you, and I’m very much looking forward to next time. Thank you.

Awesome. Thanks for having me.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

0:00 / 0:00