Go Time – Episode #299

All about Kafka

with Matthew Boyle from Cloudflare

All Episodes

In this episode Matt joins Kris & Jon to discuss Kafka. During their discussion they cover topics like what problems Kafka helps solve, when a company should start considering Kafka, how throwing tech like Kafka at a problem won’t fix everything if there are underlying issues, complexities of using Kafka, managing payload schemas, and more.

Featuring

Sponsors

Changelog News – A podcast+newsletter combo that’s brief, entertaining & always on-point. Subscribe today.

Fly.ioThe home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.

Notes & Links

📝 Edit Notes

Chapters

1 00:00 It's Go Time! 00:57
2 01:34 Matt's qualifications 01:27
3 03:01 What is Kafka? 09:23
4 12:24 Sponsor: Changelog News 01:32
5 13:57 Barriers to entry 21:44
6 35:41 Incident.io 00:57
7 36:38 Chasing the next language 13:07
8 49:45 CRDT 05:04
9 54:49 Kafka and Go 10:22
10 1:05:11 Protobuffs 03:27
11 1:08:38 Unpopular opinions! 00:23
12 1:09:02 Matt's unpop 05:33
13 1:14:34 Kris' unpop 09:28
14 1:24:02 Outro 01:38

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Hello everyone, and welcome to Go Time. Today I am joined by fellow co-host. Kris Brandow. Kris, how are you doing?

I’m doing well. Happy New Year, since this is our first episode of the year.

I guess this is our first episode of the year. I completely zoned on that. I guess we let the ball drop and didn’t have one out last week…

It happens.

Alright. We’re also joined today by Matt Boyle. Matt, how are you?

Yeah, good. Thank you. Happy new year. It feels weird to be saying that at this point in January. I thought I’d done all my Happy New Years, but Happy New Year, folks.

Alright, so today, Matt is joining us to talk about Kafka. So I guess we could just start off with Matt - would you like to tell us about your experience with Kafka, so people know why you are authorized to talk about it?

Authorized is a strong word, but… Yeah, I’ve been interacting with Kafka in some capacity every day for the last three or four years. I’m an engineering manager at Cloudflare, and part of my team’s responsibility is to provide tooling to make it easier for other teams to interact with Kafka, and provide sensible defaults around various different configurations, and how you work with it etc. And nearly all of that is written in Go.

I also recall that you wrote – you’ve written a couple blog posts about Kafka, haven’t you?

Yeah, I seem to have accidentally ended up in a niche where I write about Kafka quite a lot, so… I wrote one for the Cloudflare blog, which is basically just telling the history of Kafka usage at Cloudflare. I also wrote a few bits and pieces and done a couple of conference talks around dealing with failure with Kafka, too. When things go wrong, how you can think about it, how you can reason about it, and how you can build patterns to help it self-heal, if you will, make your applications carry on running even, when things aren’t behaving how you’d like them to.

That’s always good to know, because as we all know, computers don’t always work the way we want them to. Kris had lots of fun with that today… [laughter]

Yeah… Computers rarely work the way we want them to.

Okay, so I wanted to start off thinking about this more from a beginner perspective, of talking about what Kafka is, and leading into why somebody might start introducing it into the tech stack. And then we can dive into more complex subjects as we get there… But rather than just jumping right into the deep stuff, I wanted to talk about just what is Kafka at a sort of beginner level.

Yeah, for sure. So one thing I would recommend as well is after this episode go and listen to the Event-Driven Architecture episode of Go Time, from - I think it was last week… Because they actually dive into a lot of the reasons you might use Kafka, in a lot of detail, and even talk about Kafka a tiny bit at the end of the episode. So it complements this episode really nicely. I was listening to in preparation of this, and just nodding my head all the way through it.

But Kafka is essentially a – it can be used for a couple of things, but effectively, it’s a distributed data source. So it’s optimized for like ingesting, processing and streaming data. And by streaming it - I mean, it could be coming from thousands of data sources at once, and it landing and going through Kafka before either going to a different system, or maybe even sitting there.

One really wild reading of Kafka is it could be used as a database. It’s a write append log. So there’s some really interesting use cases that come because of that… Kafka can process data sequentially or incrementally; it depends what you want to use it for… But it mostly gets used for data pipelines, moving stuff from one place to another… Or how we use it at Cloudflare often is using it as a message broker, where it sits in the middle of two applications, and can help you build an event-driven architecture. It means that you can integrate with other systems without knowing too much of the details about how those systems are going to communicate. You just emit messages and events of things that are happening around your company, and other things that are interested can listen and pick those up. So data streaming, and sort of that mediator between lots of different applications is a really popular use cases of Kafka, typically.

So when you’re talking about these circumstances where you’re emitting messages and not really understanding – you know, you don’t have to know too much about the other services… What circumstances lead to a company needing to do that? Because a lot of people when they get started, they build a monolithic application that everything’s sort of self-contained in one thing. So how do you get to the point where something like this is necessary?

Yeah… So it gets down into the microservices conversation a little bit. And there’s tons of people, and there’s tons of great writing about why you should never use microservices. I’m a big fan of them. I think they help you manage failure and keep it isolated. So once you start to get to a certain scale, where you’ve got huge feature sets, you’ve got huge products, and maybe you have a lot of diversity in your products, where it can make sense for them to scale or fail independently, it might be a good time to start thinking about whether an event-driven architecture would be beneficial to you… And you may choose to use Kafka as sort of the mediator for some of that conversation, if you will.

Cloudflare is a great example. At CloudFlare we have hundreds of products, and we have a very clear separation of what some of those products do. We have things like CDN, which obviously can help serve up assets that are commonly used from the edge… But we also do things like alerting, which we mainly do on post-processing; things that happen after the fact. Those two things are not coupled at all, and they can survive and offer value to our customers independently. So having them as separate services makes a ton of sense. And the service that serves our CDN will shoot messages out, that will get read by other services at Cloudflare, but they will be completely separate.

[00:06:15.03] So when you’re talking about messaging brokers like Kafka, and using it that way, do you happen to know what are companies typically doing before they get to that point where they add a messaging broker of some sort, or use a messaging service of some sort in the middle? And why does that technique or whatever they’re using tend to not start – start to fail, essentially.

Yeah, that’s a great question. And I think – I’ll start by saying that I think you can get an awful long way without using Kafka. And it might be that you never need it. It does come with a bunch of complexity, which we can dig into in a little bit… But, it’s definitely not a given that every company will end up needing or wanting something like Kafka. I think you described the initial situation that most folks start from, which is the monolithic application, or maybe a couple of microservices that talk to each other synchronously. And then maybe you’ll start to think about other use cases.

So maybe you started with a couple of services that were talking to each other directly, but you wanted to build a data warehouse, so you wanted to have like data pipelines of processes that happen after – let’s say after a user is created. The user is created, and instead of just creating an account, you want to store that somewhere else. So maybe it kicks off some marketing campaign, or maybe you have to report it to some government body, because maybe you’re working in banking, or something like that. There might be like 5, 6, 7, 8, 9 things that happen because a user was created.

And for a while, you can just write more code, and you have this stack of calls that you’re going to make… But if one of those fails, what happens to the four that happened before, and what happens to the four things that were meant to happen after it? And when you start getting into these situations where you have a bunch of things that are kind of independent, but at the moment they’re being treated as a stack, it could be a good opportunity to start thinking about “Well, which of these things need to happen in the–” what I call the user flow. Which of these things will need to wait for the user? Does the user need to wait for it to complete before I can give them back a 200 response code? And most of it is just that we’ve accepted your application, or that we’ve created a user; that’s the only thing you need to know about at this time, right? So making you wait until we’ve submitted something to our data team, or that we’ve submitted something to a third party that is part of our internal processes, but not part of actually creating a user account is probably a good time to start thinking about splitting some of those things out.

So as part of that conversation, you might look at a few different message brokers. And there’s tons of options out there. You can do stuff in code… I used to write – I don’t know if I’m allowed to say this in Go Time, but I used to write PHP a long time ago… And Laravel has some really nice stuff built into it, that allows you to do sort of internal message communication, even within the application itself. You can build almost like a message bus within the application. And that can be a really great place to start. And if you interface that well, it’s very easy to swap that out to be Redis instead. Redis is very, very easy to spin up as a message – maybe message broker is the wrong term, but it can do like publish subscriber stuff, where you can write things to Redis and other things can listen, and they can start taking actions on there.

So there’s these very small steps you can take to move away from this very monolithic, everything-happens-at-once thing, to “Okay, now we’ve got a few things happening off of a single event happening.” And as you start to uncover more and more complexity, you might start to feel that the choice you made, the Laravel internal message broker using Redis doesn’t quite match your resilience use case, or your latency use case anymore. And at those points, you might need to look around again and start thinking about “Oh, maybe I need to look at RabbitMQ. Maybe I need to look at Kafka. Maybe I need to look at Pulsar”, which is another one that’s getting popular. I haven’t used it too much yet, but Pulsar seems to be another one that’s sort of raising in popularity.

It seems like you probably want to start using something like Kafka, or really any of these types of message brokers or things like that once you have a decoupling between the teams and their dependencies. So it’s like, you might be producing data that somebody who you don’t even interact with wants to consume, and so you want to set something up like this so that it’s possible for them to do without you having to be like “Oh, I actually have to like call your service, or hand you this data directly.” It’s like, you can just go get it from over there. And it seems like you probably want something like Kafka when you scale up to the level where you have a lot of that happening in your organization… Versus if it’s small, you’ll probably do that with just like a shared database, or with Redis, as you mentioned, or with maybe something internal to your monolith. But as you scale up, those options become untenable, especially when you want to deal with data migrations and things like that.

[00:10:28.24] Yeah, that’s a great way to think about it. And I think – you hit on a really good point there, which is one of the huge benefits… And this happens all the time at Cloudflare. We emit domain events - account created, things like that, is the obvious and easy one… And there’ll be teams listening to those, that – so you might own the service that emits the account created event, and there might be teams consuming it, that you actually have never interacted with. Because you’ve emitted something, it has a contract, they read it and they do whatever they need to do given that information. And they’ve built a powerful system off of an event that you emitted, and never had to communicate with your team… And that’s incredibly powerful as you start to scale your engineering team.

I think the episode you had mentioned that was on event-driven architecture, that was one or two weeks ago, or whatever it was, the one example they had on that one, that I thought was really good, was that webhooks are a way of thinking about this where a service can emit events of some sort that your application can consume, and they don’t really need to know what your application does with it… So if I was talking to a beginner who was like “What might I use it for?”, I think I’d point them to like “If you’re using an API that has webhooks and you’re using those, think about in your system; if you get to a point where that type of setup could be useful, where you want to be able to emit those events and people can do whatever they want with them.” Whatever type of information that is, that makes it really handy at that point. And it’s I think it’s easier to understand at that point, because a lot of people end up using webhooks before they use Kafka… At least I think in my experience that’s what ends up happening.

I think that’s true, because I think there’s some really great products out there now that have made webhooks really accessible. I always remember the first interaction I had with webhooks was GitHub, and being able to – they had a really great API, where you could listen to commit messages being made, and stuff like that… And then I think it was – I’ll probably pronounced this horribly, but Zapier… It has made it so it’s like webhooks as a service; you can just click things together and build really powerful workflows without even having to really understand how webhooks work. So they’ve almost brought this event-driven model front and center, even if people don’t realize that’s what they’re interacting with.

So let’s say we get to the point where we’re ready to try using Kafka… Are there any, I guess barriers to entry, or things that people need to worry about when they’re trying to try it out?

Yeah, it’s a good question. To be honest, it depends why people want to use Kafka. I would a lot of times probably try and turn people away from it, honestly… Unless they’ve got a really good sort of vision in their head about why they’re using it, or it’s very clear their company is going to get to a scale where it makes sense. It does come with a lot of complexity that can be avoided by using some other tools.

I think this is trending downwards these days, as there’s more and more companies and managed services available that are making it even easier to get started with Kafka than ever before. I tweeted about one recently; I’ve been playing around with Memphis.dev, which is really cool. It’s providing abstractions over Kafka, that makes it possible to get started if you don’t have a bunch of experience or expertise.

But to answer the question directly around sort of what those barriers to entry are, is there’s some language and things you need to learn before you can even start using Kafka. For example, there’s a concept of a topic where you have to emit messages. And then topics are split down further into partitions. And there’s some real gotchas, like ordering is only guaranteed within a partition, not within a topic. And so you could start using Kafka, and if you didn’t know that, and you made a mistake, you could cause chaos within your application.

So there’s this learning curve on both the person emitting the message side, the person consuming the message side, and also, actually running the infrastructure is pretty complicated, too. At CloudFlare we run it ourselves. I wouldn’t recommend doing that unless you’ve got like a pretty sizable engineering team. And for most people, the self-managed services available are probably the way to go.

In some cases, using Kafka is going to be a complete rearchitecture of your application. We’ve talked a little bit about the monolith to the event-driven architecture sort of style, and maybe data streaming, stuff like that… It’s not something you want to do unless you’ve identified that you need to do it. Because I think moving to event-driven architectures in some environments is like a sexy project, it’s a cool thing to do, but they are incredibly difficult to do, and to do well, and in a lot of cases you end up with the worst of both worlds. You have this half event-driven architecture and this half monolithic thing… And I’m seeing everyone smiling, because I think we’ve all worked on those systems. And you need the time to commit to do those things well.

I think there’s also a level of you need to have good communication systems within your company. If you have to directly integrate to get data from somebody, and they have to call your service or hand you some data, then there’s a sit-down meeting you have, and there’s a point of synchronization that occurs. But if you just have streams and feeds of data that you’re pulling off of from Kafka, a change you make as a team might have very wide effects on other people in the organization, like deprecating a field, or deprecating an event you emit, or changing how you emit an event. And if you don’t already have that communication structure set up to properly broadcast these things out to everybody, you can get yourself into a lot of trouble. I’ve definitely experienced that in the past. And you have to be prepared for what life with append-only log essentially looks like, where it’s like your data migrations don’t look the same… Rewriting history is not really something that you can do, so if you want to get more than just the simple message bus, and you want to actually be able to use the history of things you have in Kafka, you have to actually start thinking about your data in different ways as well, and your organization and your engineers have to be prepared for that kind of reality.

That’s a really great thing to call out as well, because it’s something I think gets taken for granted now. But one thing that Kafka doesn’t give you out of the box is like strong schemas on your events. So the same way when you build an API, you build an API contract, and we all kind of know that you shouldn’t rename all the fields, and delete fields and stuff, because that can be consequential, and there’s patterns for doing that… Maybe you’ll version the endpoint, or if you use GRPC, you do some other things… Out of the box Kafka doesn’t do any of that for you, so you need to start thinking about how you do want to manage the schema of your messages. And there’s a few different ways to do this. There’s a really popular tool called Apache Avro, which is like a way to do some schema stuff… But at CloudFlare we use Protobuf. So all of our message schemas, we have a central repository that’s got a bunch of checks on it for breaking change detection, and stuff like that, for exactly the reason that Kris mentioned there; if someone tries to remove a field on a schema, that could cause havoc if we don’t handle it well… So we just straight up just reject those changes. If you try and make a pull request, and you’re going to make a breaking change to a schema, we just don’t allow that to happen.

[00:18:20.22] The slightly controversial decision we made, that I think has been a good one for Kafka adoption, is we made it so you can only emit one event type to one topic. So again, Kafka doesn’t stop you from doing this. You could have a user-created event, you could have account-created event, you could have account [unintelligible 00:18:37.23] they could all go to the same topic, and a service may read all of them and care about all of them. What we did is we said user-created events can only go to a topic called User Created Event. And I think we versioned them as well, and it’s called v1 and v2. If you try and emit any other type of message to that topic, we’ll reject it. And that was, again, for some of the reasons that Kris mentioned there, is people can expect what pops out on either side of every topic, and that makes it very, very easy for things to be predictable. And I must say, we haven’t had any real issues with schema, or sort of changes to messages since we made that enforcement… So I think that was a really useful and powerful thing we did to help drive adoption and to stop some of these footguns that can happen if you don’t do these things.

Another one I’d bring up as well is that the security model around Kafka is, I guess, a bit different than if you’re doing direct communication between services… Because it’s like, there’s something in this topic; where did that thing come from? And you have to do – either you have to lock down who can publish to that topic, to specific services, or have some sort of authentication on the data itself, which is like a sort of complex thing that you need to think about, that I think a lot of people just like – it doesn’t pop into their head when they think about this type of architecture. Like, “Oh, all of these events. How nice!” But it’s like, you don’t have that provenance of “Where did this event come from?” All you know is that it got in the queue, and it’s in the queue, or gotten in the log, it’s in the log, and you can process it. But that isn’t a potential attack point for people that get into your network, and it also makes Kafka a very nice target for people that want to do nefarious things to your products, your infrastructure, all of that, which means you have to be especially careful around permissioning and how you integrate things with Kafka. Even just taking it for one rogue application, if anything can write to any topic, that could cause lots of problems. There’s a lot of those types of security vulnerabilities that you need to keep in mind when you start using - not even just this technology, not even Kafka specifically, but any type of technology like this.

Yeah, it’s a really good point, because also, message brokers and Kafka are natural amplification vectors, aren’t they? If I emit one message, nine services might do something, or 10 services, or 15 services. So if you do want to cause chaos, emitting one message to a topic might lead to 15 services taking action, and so it’s a natural amplification that happens in the system… So getting that permissioning and that security model right is super-important. Because even teams that don’t mean to - say someone emits a message in good faith, or starts changing things, or emitting a bunch of test messages or something like that, it could cause chaos. And they didn’t mean to do that; they were maybe just playing around.

So having good protections around these things… Basically, all the things you think about when you think about building an API - they are they all apply here, just in a slightly different way, that you need to think about upfront. We’ve talked about versioning, we’ve talked about almost like having rate limits on your cluster, and sort of having permissioning to make sure you know who’s talking to you, and that you trust them is a really important thing to think about, that often can be things you don’t think about until the end, but they really do need to be thought about upfront here.

Yeah. I think that’s probably a good point of suggestion as well, is if you’re struggling with using APIs right now, and getting the API contracts, and the design of APIs and security model around APIs right, you probably shouldn’t move to one of these architectures, because you have to absolutely get it right at the front end. It’s much harder to fix these things later on, after you’ve already deployed a system and things are working. I have tried, it’s like very frustrating trying to fix things afterward.

[00:22:11.11] I think sometimes people run to these solutions as like “Oh, everything’s such a mess with our APIs. We just need something better”, and then some will come and be like “Event-driven architecture. Or message brokers. That will solve all of our problems.” It’s like, no, that will most likely make your problems even worse.

So if you’re having struggles with things like APIs, the fundamentals are what’s important. Make sure your schemas are good, make sure you have that communication system set up, make sure you have thought about security, and you do threat modeling, and all of that sort of stuff before you shift into this new world.

I think another part of that makes it challenging is that – I’m thinking, if somebody wants to learn web development, and they go learn Ruby on Rails, there’s a kind of clear set way of organizing code, and connecting to a database. There’s a lot of standardized ways to do things that help you; if you follow all these rules, your application should be relatively secure. But as you get into more complex technologies like Kafka, and event-driven architecture, it’s almost like there isn’t a single standard way of doing stuff. Every company sort of figured out the way that works well for them. Because when you start to get to that scale, I feel like it’s a lot more of it depends on what suits their needs, and their use case… So it’s a lot harder to teach somebody “Here are the ways you do it securely and correctly”, because there’s not one set way of doing it. And I think that makes it really challenging, because you can’t just clone what somebody else is doing and saying “Oh, Cloudflare does this, so it’s got to be secure”, because that setup might not work for their company.

Yeah, I think you’re 100% right on that. And I think the other piece is there is a barrier to entry to learning this stuff. I think it’s very easy – there’s tons of websites, and tools, and courses around how to build a basic API, or even to do some slightly more advanced things… But there’s very few sort of end-to-end courses on how to build an event-driven architecture from scratch, and how to use Kafka from scratch in a meaningful way, beyond just connecting to it and publishing Hello World and reading off the other side. And there’s a reason for that. To your point, Jon, there isn’t a right answer for a lot of these things. There’s good patterns and good practices that you could teach, but it would be incredibly hard to get that to a place where it was consumable without assuming a bunch of prerequisite knowledge of the person learning.

So I think there’s definitely an opportunity there for like a really good cause to exist to teach people these things… But I think one thing that we’re very fortunate is, as time goes on, the barrier to entry for these things is getting lower and lower all the time. I think I already mentioned Memphis.dev, and there’s also Encore.dev. I think you’ve had them on Go Time before. They allow you to create sort of like serverless event-driven architectures very, very quickly and easily, and you can get started without like paying anything.

So I always encourage people to go and poke around on those, and build something; just see how it all works and how it clicks together. It’s one of the few times when - you know, as a learner, there’s kind of two schools of thought. If you’re trying to build a project that’s useful to you, you absolutely should not overcomplicate it. Keep it simple, make sure you ship something; shipping is really valuable, and it’s a skill to learn. Once you’ve got a few projects under your belt, and if you are learning for the sake of learning, you go ahead and make your project incredibly complicated for no reason. Build a crazy event-driven architecture that gets deployed to Kubernetes, just to see how it all clicks together and how it works. There’s so many ways you can do that for free, or very, very cheap these days, that it’s a really great time to start trying to play around in this space. Whereas even four, five, six years ago, when I was learning this stuff, you had to sign up to AWS, and you had to put your credit card details in, and maybe you had to spend $50 to learn it, and that’s just not accessible to everybody, right? So it is getting better.

I think that’s an area too where it’s like we as an industry have really encouraged people to get started fast, and “Here’s all the easy on-ramps.” But I think we do a bad job at reinforcing that you do have to go back and relearn, or learn all of those underlying things.

[00:25:57.08] So if you do start with something that does package everything together, like it sounds like what Encore does, or what Memphis.dev does, or any of these other things, where it’s like “Hey, it gives you a nice little packaged thing”, you still need to go back and learn all of the stuff that they’re doing for you. Because if you don’t, and their product shifts, then you’re screwed if they shift in a way that doesn’t meet what you need.

I’ve run into this a whole bunch of times, where people want to use some technology that is popular - protocol buffers are a great example of this - without really understanding the complexities of actually managing them, and pairing them with things like microservices… I literally had a job where we had this nightmare situation, this nightmare confluence of Go modules and protocol buffers, and like dozens of repos, and it just never fit together quite well… And even though each of those decisions individually made a lot of sense, when you put them together it just didn’t fit well. And a lot of that was because people had figured out that these technologies were good, and they were important, and they would create a solution for us, but they hadn’t actually learned all of the intricacies of how they work under the hood, and the mechanics and the communication I was talking about earlier. And that just led to disaster, and led to “Oh, we might need to get rid of this whole platform and try rebuilding it again.” So as nice as it feels to get started really quickly with these things, it’s super-important to go back and learn the underlying things, so you can actually understand how all of it is working.

I think there’s a lot to be said about that mindset of “It’s okay to want to learn overall how something works very quickly.” However you’re learning, I think it’s good to want to sort of get an overall picture of how everything works, so that later when you go back through and learn details, you sort of understand how everything interacts with each other, and then those details make a lot more sense. But I agree that if you just learn the overall picture, and then don’t ever understand the details, that all of a sudden you have that problem of – you’re kind of, like you said, locked into something. And I think you came from Ruby on Rails too, Kris, and I think that was something that happened a lot with people there, was that they’d kind of learn roughly how Ruby on Rails worked, but they wouldn’t really understand a lot of the behind the scenes things, and then the minute you needed to do something slightly custom, it was a weird situation where some people understood it, but it was also fighting with a framework to make it happen. And I think that’s why a lot of people liked moving to Go, is that I felt like you got a lot more control over that stuff… But the downside was you really had to understand how all of that stuff sort of interacted with one another. And I don’t think it’s any different when you get to more complex scenarios, like using Encore, or Kafka, or anything else. You kind of have to have a rough idea of how everything works, I guess, like you said, at the overall level, and make sure you understand the details that you need to know.

Yeah. I actually came from the wonderful world of PHP, and Drupal… But it was a similar thing, where it’s like Drupal is this weird piece of technology that can do anything that you want it to do, but I remember the feeling, when I did leave that world and was excited about Go, it was a lot of just like “Oh, how do you even build a very simple web application with Go?” And it’s like all of these things that I’m so used to having are just not here, and I have to go learn all of those fundamental things, like “This is how a templating engine works, and this is how an HTTP server works, and this is how a muxer works”, and this is how, you know, all of these individual things assemble together into what I was used to using… Which I think is a thing that happens with a lot of people.

And I’d also say that I think it’s easy in our minds when you’re learning one thing to be like “Oh, I’ll just learn this thing quickly, and then I’ll come back around and learn the underlying stuff.” But there’s always something new to learn. There’s always something new that you’re going to want to add. So if in your mind you don’t know – like, you can’t say when that you’re actually going to go back and learn the fundamentals, you should probably take the hard road from the beginning… Because it’s very difficult to make the time to go back and learn if you’re the type of person who’s just going to keep going, and just keep learning all of the surface level stuff… Because it feels like you’re learning a lot, it feels like you’re doing a lot, but you’re not. It’s like that kind of empty calories sort of thing, where it’s like “Oh, I know how these technologies work”, and then you actually have someone ask you questions about them and you’re like “Oh, I have no idea how these technologies work.” You hit a problem, and you’re like “Look at this giant trail of things that I now have to figure out how to actually fix.”

[00:30:24.08] And that’s a situation that I’ve personally wound up in a few too many times for my own liking, but I’ve seen many organizations and companies wind up in, of just like “How did we even get here?” It’s like “Oh, no one stopped to really push back and say “No, no, we actually have to learn how everything works.” And it’s much more painful to do that later than it is to do that right now; just to say “Hey, guys, we’ve got to slow down just a little bit. Just to simplify our architecture, just to learn how these technologies work, so we don’t get into that bad situation.”

It’s tangential, but there’s a really popular book that I’m sure you all have heard of called “Learn Kubernetes the hard way”, that Kelsey Hightower wrote… And it’s the perfect framing for exactly what you’re describing, Kris. It’s a book that teaches you a bunch, but you can be practical as you go through it. Like, you’re learning a bunch about how to set up Kubernetes and how it runs, but you do it alongside practically setting it up. And by the title of the book, it is very, very difficult. It took me a couple of weeks to go through it and to try and get with it. And I still would not consider myself a Kubernetes expert, but previously to that I’d done exactly what you described; I followed a course or something, and I spun up a cluster really quickly, and [unintelligible 00:31:30.26] working, and I thought I understood it, and it wasn’t until I did that piece that I was like “Actually, I understand this way more than done before that.” I haven’t produced as much in terms of output, but… I’ve basically just got to the start of having a cluster available, but now I understand what went into doing that. And it’s six years on now I think since I did that, and I’m still learning everyday about Kubernetes… So it’s definitely a beast, but doing that the hard way approach is a really nice thing, and I’d definitely love to see more books and courses in that realm, like “Learn Kafka the hard way”, or “Learn Go the hard way”, perhaps. I don’t know.

I feel like that’s a natural shortcoming when it comes to learning tech for a lot of people… And it’s almost introduced right when people start learning how to program, because you see so many people who go try four different languages, instead of just sticking with one language, despite it feeling hard or challenging. And I think part of that is that people think “Oh, this language isn’t for me”, or “I’m not smart enough”, or whatever it is, but in reality, it’s just that you need to stick it out and put more effort in, and it just takes a little bit of time and practice to really learn that deeply enough to understand it well enough to do the more advanced things.

It probably sounds like a really stupid thing to say, but one thing I didn’t realize at the start was that pretty much any language can be used for any of these things. So when I was trying to learn and figure out which language I wanted to learn, right at the start of my career just coming out of college, I was like “Well, I know Java can do this, because I did that at school. But what about JavaScript? I hear that’s used on the web.” And it wasn’t until maybe a year or so into my career I realized “Whoa, you can use that on the backend, you can use it for all these things.” It took me a lot longer to realize actually all these languages can be used for pretty much anything, and if I stick with one and just become great at it, I’ll be able to do pretty much anything I want to do. I wonder how many other people don’t get that realization until maybe a little bit later on, that sticking with one language probably would have yielded the outcome they wanted if they just stuck with it.

I feel like that’s the kind of underlying current for a lot of the discussions we have. Even this one, we’re talking about Kafka and event-driven systems - it’s like, pretty much anything you can do with an event-driven system you can do with a monolith, for most things that most engineers are going to write. Like, the decision between a monolith and microservices, or a different service-oriented architecture is really one about the non-tech stuff, about how do you communicate as an organization, how do you want to set things up? How do you want to scale? And I think that a lot of the rhetoric we have winds up being “Okay, monoliths are completely unscalable, and you can’t make them work at all.” And then you see these giant companies, that have giant monoliths in them, and everything’s working fine. Or you see – there’s a whole idea that mainframes are kind of this old, dead technology. And it’s like “No, no, mainframes power the entire world.” Like, if mainframes disappeared, everything would crumble. You wouldn’t be able to buy anything, you wouldn’t be able to travel… There’s so many things that would not work without that technology.

[00:34:20.21] And we just look at it as like “Oh, that’s old. That’s bad. That’s not the good way of doing things.” And I think things like monoliths come with that… Or even programming languages; people are like “Well, that’s an old language. You don’t need to bother with that anymore.” And I think that’s one of the things that we have to figure out how to fix as an industry… Because we are extremely hype-driven, and I feel like Kafka winds up being at the center, or was for a while at the center of a lot of hype, and I feel like it still is, of just like “No, no, sprinkle a little Kafka on whatever you’re doing, and it’ll fix your problems.” [laughter] And it’s just like, “No, no, sprinkle a little bit and you’ll have weeds; you’ll have lots of problems you need to fix.”

So yeah, I think we as an industry just need to really focus on just getting back into that – you can use any of these things to do whatever you want. What matters is how deeply you understand them, and how deeply the people around you understand them. And what programming language you should use is about you, but it’s also about who’s around you. Do you want to work at companies that write that programming language? If you want to work on cloud-native software, and you don’t write Go - well, you’re probably not going to be able to go work at a company that writes cloud-native software, because that’s all written in Go. Whereas if you want to work on firmware, Go is probably not the best language for you to go work on firmware. There’s some companies that are doing it, but it’s not the most prevalent thing.

Yeah. I’d like to give a shout-out to incident.io here, because their engineering blog is excellent. And they’re kind of doing some of this. They’re fighting the battle, and just like making clear there is a pathway you can build a scalable, important company without using microservices. And they wrote this particularly great blog post, I think it was called “Split the workload, not the monolith”, I think it was called. I can share a link to it. Basically, it’s about exactly what you described. It’s like, we’re building an incident management platform. Uptime is incredibly important to us, but we’re actually not going to do microservices; we’re going to do it as a big monolith, and split it down in different ways. It was all written in Go, and I think they received some pushback on Hacker News when it was shared, but… If you just read it, take the hype out of it and read it, it’s very sensible. And I really like that they’re putting content out there that goes against, as you say, the hype scale of it, and offers an alternative… Because I think they’re the best tech blogs; they just offer a slightly different perspective.

It kind of makes me wonder if we’ll ever get to a point where people aren’t chasing the next programming language… Because all of us are happy using Go, for all sorts of things… But I’ve already now – and Go is not even that old, but I’ve had people tell me, who are prominent in the Go community, that they feel like Go is dying, and that some other language is the new thing. And I think they told me that just - for various reasons, but I’ve heard it, and it’s like weird to hear, because I’m like “Go is not that old.” And the amount of things that PHP and Python are still powering is outright – and people still build stuff in those languages. So I’m like “Why would Go just randomly die all of a sudden?”

But I think part of that is that hype thing, where depending on the industry you’re in – as an example, if you create courses like I do, or if you do corporate training, or if you do anything along those lines, where you’re helping companies come up to speed with something new, I feel like you’re almost enticed and encouraged to follow that hype train and help produce more hype about something new, even if you know it’s not better for the people that you’re actually trying to help.

Yeah. I think this really does come back to us as an industry being so extremely hype-focused. I mean, it’s literally everywhere. We’re going through the AI hypecycle right now, where everyone’s like “AI is gonna change how everything we do works.” And it’s like, yeah, there’ll be change, but it’s not going to revolutionize things in that way. I think microservices was the same way, and event-driven architecture was the same way, probably agile was the same way back 20 years ago, where it’s just kind of like “Oh, this will be the one thing that fixes our problems.” But I think at the end of the day, our problem is that we’re very bad at teaching people deep things as an industry.

[00:38:19.27] My own journey, trying to just learn about distributed systems so I can teach other people about distributed systems has been extremely challenging, because so much of the literature is very old; so that’s an immediate bias that we have right now, because I think a lot of people are like “Oh, this book was published in 2015. It’s too irrelevant. We shouldn’t be reading it.” And some of the stuff I’m reading is from like 1980. But I think it’s also that it’s sometimes tough to actually really dig down and understand the reasoning behind why you want to learn something, or what relevance it really has.

I think the underlying problem is not that most people are doing this. I think it’s fine if most people are chasing the hype cycle. But where I think the problem comes in is when the people that want to create the next foundational technology only understand surface-level things. And I think that happens a lot. An example I’m dealing with around this right now was like with REST, where I’m actually trying to like learn REST, and I’m reading Roy Fielding’s dissertation, and it’s like a) infuriating, because the things he mentioned, I’m like “Hey, you could have written this yesterday, and it’d be equally applicable to our entire state of the world.” So that feels kind of bad, since it was written like a quarter century ago at this point… But also, just how brilliant of a design the system that underlies the web actually is… And then you contrast that with how quickly people want to throw out that design now, because they have particular gripes with it, not understanding that that gripe is actually a feature; it’s not a bug, it’s a feature of the system, and it enables all of this other stuff. And I think that those small features, that small bit of nuance often gets papered over when people want to just be part of the hype cycle, or it just kind of gets thrown out… And so people are like “Oh, this thing, we need to get rid of this problem.” So the next set of people that are building things are just like “Oh, okay, well, that’s the problem we have to solve”, even though that could be an unsolvable problem, or not a problem that we should be focusing on.

And I guess one of the reasons I brought this up is because when I look at Kafka, or these event-driven architecture things, I see what could have been with the web, or with REST in general, like this could have been built as a RESTful system, and how powerful that would have been to have these systems within our organizations… But we just don’t have that. And the reason we don’t have that is because people don’t understand what REST is, or what these older technologies are. And learning them requires you to go back and read a 160-page long dissertation from 2000, which is not what the majority of people in our industry are going to do. So I think that’s the point of contention that we have at the moment, is that we have to figure out how do we actually teach people these deep things in a way that doesn’t feel like you’re reading a textbook.

You don’t want to break that dissertation up into 10 tweets? [laughter]

You know, people have tried…

People know this, because they get on Twitter, and you lose all sorts of nuance. So they know that, but at the same time it is hard to sit down and read, do deep thinking and think about something really complex and read about it.

Yeah. And part of it too is that – I think it’s like when generations shift. Because if you started learning programming in the early 2010, there’s a set of technologies that you learned, and you’re like “Okay, these technologies make sense to me.” That was the era of Ajax, and “Oh, all of this stuff, and SOAP, and XML, and all of this stuff is super-bad, and gross, and we shouldn’t have it…” But because everybody says all of that stuff, you don’t go and learn it. So you don’t really know what that thing was. You just know that that’s the bad thing. And then you wind up, ironically, reinventing that, which I think is what we’ve done at this point.

[00:42:15.25] If you look at what Swagger/Open API has become and what these things are, it’s like “Oh, this looks remarkably like that thing that we built before.” So you’ve now built the bad thing that you were trying to avoid having. And I think that’s like the same thing that’s happened with like monoliths and microservices, where people are like “Monolith is bad, microservice - good. We should have event architectures.” And then you go look at it, and you’re like “Actually, you just rebuilt that same bad thing, again.”

So you’ve got to go learn the things… And I think maybe that’s the thing I think we should do, is like go learn the stuff that you don’t like, or you think is bad, and understand why it’s bad, or why it’s that way, so you can make informed decisions moving forward.

I remember I sat down to tried to learn GraphQL at some point, and I tried reading a book, and it was trying to go over the history of REST APIs, and why REST APIs don’t work… And it was just wrong. The history was just completely inaccurate. And I’m like “How are you at the level of writing a book about this thing if you don’t even understand the history of the other things that came before it?”

I think some of those, like you said, they are definitely challenging if you don’t understand the history. Because even like the – the GraphQL one makes me think… I went through a very similar experience, and I remember reading a bunch of stuff. And the whole time, all I thought was “I’ve interacted with APIs that are REST APIs, that solve all of these problems that GraphQL is there to solve.” And it’s not that it’s impossible, it’s just that yes, a lot of APIs implement this poorly; you’re not wrong there. But I think throwing another technology on top of that and expecting it to magically solve a problem is, like we’ve said with Kafka and everything, you can’t just throw some technology on top of it and be like “Voila! This fixes the problem.” And I think people saw that with GraphQL, where if you have arbitrary queries going into your database, that’s not necessarily a good thing. So you have to find ways to limit things. And then eventually, you get back to the point where you’re very close to the REST API system you had before. So I don’t know if that’s true with Kafka, necessarily… Like, if you can do the same things, I guess.

With Kafka, one of the things I remember from the early days of when it was being developed is the idea was like “Oh, you’ve taken the replication log out of your database, and you’ve exposed it out to the rest of the world.” Which was a really interesting and novel idea. I think, Matt, you alluded to this earlier, using Kafka as a database. And I feel like for a lot of things that got kind of pushed under, it got pushed away. It’s like “No, we’re gonna use Kafka instead as this message broker.” Which I think it’s an okay-ish, a decent message broker, but I think it’s much less interesting of a technology from a historical perspective as a message broker, and I think it’s much more interesting as what it is, a distributed log, and that part of your database that you’ve pulled out and made accessible for everybody to use, and what really that could mean for how we manage data as a whole, and what our databases in the future look like. But I feel like that all got lost under the wash of “Oh, this is better than RabbitMQ” for reasons that I still don’t necessarily agree with… But I think that was like a big push of the hype, was “Oh, instead of using RabbitMQ, or instead of using Redis, you can use Kafka, and it’ll give you these nice properties. And you should just replace RabbitMQ with Kafka.” And I think that’s, to a large degree, what people have done with it. But not understanding that history of it, I think we lost an even more interesting thing we could have done with Kafka, we could have built around Kafka. And still can. It’s not like it’s that ship has sailed.

Yeah. And thinking of it as a log enables some really powerful things that may be a benefit of why you do want to consider it. So one of the really nice things about it being an append-only log is you’ve got history, you can go back in time and see what order events happened within your system, and you can replay them, too. And you can do that without having to change anything about your infrastructure.

[00:46:12.14] So what Kafka enables you to do is for what we call consumers, things that read from Kafka, you can pick a partition within a topic to read from. And usually, you’ll just say the latest, or maybe you’ll say the oldest, but you might pick a specific point in time as well. And when you do that, you can replay messages, and you can basically replay history. So say you did have a service that needs to know what happened since the beginning of time. I’ll put a big asterisk there and talk about it in a minute, but let’s say from the beginning of time. Then you could potentially spin up a new service and replay every message that ever happened in your company, and it would get up to date immediately with everything that happened. And when you start thinking like that, that’s awesome. That’s really powerful. I think that there’s probably tons of other message queues that offer you that sort of durability that you could use, but Kafka really brought that to the forefront and made it a feature, which I really, really like.

I just want to loop back very quickly on the asterisk side of the… The reason I added that is Kafka does have retention, which is something actually catches [unintelligible 00:47:07.10] When I went back to sort of languages you need to learn and things you need to think about before you adopt Kafka, is technically Kafka can retain all of the messages on it forever. But that obviously comes with a bunch of cost and storage. So what tends to happen is you keep a very small amount of data retained in Kafka. And I think in CloudFlare it’s somewhere between like three and seven days, up to two weeks, depending on the topic. Then after that, you wipe history and have to start again. So it means that – say you are having an outage or something like that, and you can’t consume events for a period of time, and the amount of unprocessed events is going up… Depending on what you’ve set your retention time to be, you’ve got a finite amount of time to fix your system and be able to work through all those events. So you’ve got to always make sure that the system reading from Kafka can read faster than the rate of events entering Kafka, if that makes sense.

So if you’ve got a message entering Kafka at one message a second, and you can read at one message a second, but what happens is all sudden when you’ve gotten a 1000-message backlog, and also messages coming out one message a second, you’ve basically got no levers to pull. You’re reading through them very, very slowly. Kafka doesn’t support – you can’t scale horizontally with Kafka without foresight; you can’t just do it on the fly, you have to think about it ahead of time. So you kind of enter in a situation where you can see an incident, and you can’t necessarily fix it very, very quickly… And so that can be quite scary. And you know that if you don’t fix it within three hours, three days, whatever your retention is, you’re actually going to lose all those messages. So… It’s a huge benefit of it, but it’s also, you can see it as a ticking time bomb in front of you if you can’t fix it quickly.

And I think too that’s where perhaps we need more things that come after Kafka to deal with some of these trade-offs. As someone that studies distributed systems, I’ve studied a lot of CRDTs, and the interesting aspects of how they work, but also the ability to compact them down small enough to make a feasible, retained forever log of things… But it’s gonna be tough to retrofit that into something like Kafka. So I think we shouldn’t necessarily just like stay with the technologies we have either. We should try and advance them, but for specific reasons. Kafka sounds great for the vast majority of things, and if you can solve that problem where you make sure all of your consumers are operating at some multiple of your producers, then you’re good. But if that’s a thing where maybe you’re not going to be able to do that, maybe we need another type of technology that will allow us to have slower consumers, that have a way of catching up in the future, can do sharding more easily, or partitioning more easily than having to think about that beforehand, as you mentioned.

You mentioned – you used an acronym there, CRDT. Do you mind talking us through that?

[00:49:50.01] Ah, yes. So CRDT stands for Conflict-free Replicated Data Type, and they are essentially ways of designing your data such that when you want to converge – okay, let me wind back a little bit. So CRDTs exist in the world of eventually consistent systems, or eventual conversions, where you might have multiple processes creating data, that can’t necessarily talk to each other immediately, so that data cannot be converged immediately. So you might have different states of your system at the same time. And what CRDTs do, or what they kind of constrain you to do is design your data such that no matter how your data comes together when you merge it, you will always be able to merge it without having any conflicts that you need to manually resolve.

A very simple example of it is what we call an increment-only counter, where it’s like “Oh, I have a counter and I want to count things across multiple different partitions of my servers.” You give each server an ID, then it has a value associated with it, and then all of the servers exchange IDs all the time, and you always take the maximum ID from every other server, and then you add all that together, and now you have the total value of the counter. And since each server has its own ID, and it only increments its own ID, that means that even if at a particular time you don’t have the values from the server, you have values from everybody else, once you get the values from that other server, you can just merge it in, and then you’ll have the updated value. So there’s naturally no conflicts in this system, unlike if you just had like a single counter, you’re trying to increment across machines. So in some ways it’s an alternative to things like Raft and Paxos, which are consensus algorithms that sit at the heart of things like Kubernetes, that ensure that you’re always making lockstep decisions that won’t be rolled back in any way. So it makes sure you don’t wind up with conflict at all. So it’s not really a CRDT, but in a way, it’s another way of resolving conflicts.

The world of CRDTs is much broader than I think people think it is. At a fundamental level, it’s just like a constraint on things, saying “Yeah, no matter how you merge your data together it must always merge, and it must always resolve.” That could be [unintelligible 00:52:10.17] so you just say “I don’t know the last thing I have; for whatever the definition of last is, that’s the value of the thing”, or much more advanced things that you can do. But there’s a whole bunch of interesting research around this by people like Martin Kleppmann, who wrote a fantastic book called Designing Data-Intensive Applications, and he’s done things like make a Google Docs-like editor, where he wrote an entire research paper with a colleague, and they saved every single keystroke, every single cursor adjustment, all of it, the entire history for this document, and it’s only about I think like one and a quarter times the size of what just the plain text is.

So you can save all of this historical data in an extremely compact space if you take the time to design everything. They obviously have like a custom format they’re using to store all of that, and a whole bunch of other stuff to get it down to that compression ratio. But it kind of yields us to a world where you no longer have to get rid of any of your data. So you no longer have to necessarily say “Oh, well, my Kafka topics are getting too large, so I’ll have to drop some data off eventually.” It’s like, well, maybe you can just retain all this data forever, depending on what the type of data it is. And maybe you do get the benefits of having that long history that you can replay for your company or organization.

That’s really interesting. I’ll have to take a look into that. Thank you for sharing. I think something you mentioned though I just want to pick up on is that book, Designing Data-Intensive Systems - especially if you’re interested in event-driven architecture, Kafka and stuff like that, it’s a must read. It’s an incredible book. I highly recommend it.

Yeah, Martin Kleppmann - that book is absolutely phenomenal. It’s a nice O’Reilly book. I think it has a hug on the front of it, or a swine of some sort.

I’ll make sure we get a note.

[00:54:00.28] It is a long read. It’s like 400 pages, so just get ready to read a bit. But yeah, if you’re interested in this sort of stuff, if you’re listening to this episode and you haven’t read that book, you should go read it.

It’s a book you need to study. I sat at my desk with a highlighter and a notebook, and kind of – I only read a few pages every day, and it was a slog took me. It took me over a month to get through it, but honestly, it was an incredible read. It’s not something you’d read in bed, I don’t think, unless you’re the type of person who can retain incredible amounts of information, but… To sit and study - like, there’s not much better books out there, I don’t think.

Yeah. Especially if you want to design a system that’s gonna use something like Kafka it’s really good, because it goes over things like Protobuf versus Avro, versus other formats that are out there that are going to be integral to you using Kafka, and all of the other stuff you have to think about.

So bringing this back to Kafka in Go, since this is a go podcast… Matt, can you share your experience with Kafka in Go specifically? Has there been anything that stands out as either good, or bad, or anything like that?

So I think [unintelligible 00:55:00.06] good or bad, but I guess just kind off a brief story of how we’ve been using it at CloudFlare, which has been interesting… So firstly, I did mention that we’ve been using Protobuf for our schemas. The support for Protobuf and gRPC in Go is excellent. It’s first-class. So that was a good fit and a good choice, and I would make that choice again. So for schema management, Protobuf’s definitely worth looking at, especially if you are a predominantly Go place.
Something else we did is we created a Go library that we call Message Bus. Very creative. And effectively, what we do in this is we use – it used to be run by Shopify; it’s a Go library called Sarama. Sarama is a Go library that was created by Shopify, that basically allows you to do pretty much everything with Kafka. And I actually credit that library with an awful lot of the adoption of Kafka, both at Cloudflare and elsewhere where Go was involved, because it enabled you to basically do everything that the Java libraries were doing.

One really hard thing about Kafka as we talked about is configuring it can be hard. So one good choice that we made as a company, I think, is we made it an opinionated library, that kind of set up a very good set of default settings and constraints for how we think you should interact with Kafka at CloudFlare, and made it as easy as possible for you to do it. It’s got a bunch of power user settings, if you will, where you can override what we deem to be the best settings, but that was a pretty good choice, I think. And we added a bunch of Prometheus metrics within that library as well, so it means that everybody who pulls in our library gets this dashboard for free of how their Kafka service is performing… Which was very, very helpful, and again, is another thing I’d recommend doing. It’s not Go-specific, you can do it in any language, but we were able to do it with Sarama.

Slight tangent, but Sarama actually got picked up by IBM. So IBM is now responsible for the maintenance of Sarama, because it turns out that Shopify aren’t using it too much anymore. So IBM have taken over stewardship of it. So that was a really cool thing to do… I haven’t checked in on the project in a while, to see how it’s progressing, but it was excellent that they put their hand up to carry on stewardship.

And then the final thing that we have been using Go in Kafka for is we built this thing called – we call them connectors, and they’re built on… There’s a framework called Kafka Connectors, which effectively allows you to plug some code into your database, [unintelligible 00:57:09.24] into Kafka, and then it just like moves the data between the two. So when people are trying to take things out of a database and push it to Kafka, Connectors are a pretty common way to do that.

We built our own framework that we also call Connectors; it’s all written in Go, and effectively with a very small configuration file you write in YAML, we allow you to specify a reader, some transformations to apply, and then a writer. And so what this means is teams can deploy very simple code that reads from a database, applies some transformation to a Protobuf format, and writes it to a Kafka topic. They can do it without actually writing any code; they just create some environment variables and deploy it. And same thing - we’ve got Prometheus metrics, you get a dashboard for free, and you get some alerts around it for free, and stuff.

[00:57:49.17] So all of these things have really helped with Kafka adoption. And I think if you’ve got the resource to deploy Kafka at your company, I would really consider having a team like mine, a platform team that provides tools and services that makes it easy for other teams to do the right thing, and to teach them, too. I think a huge part of our team’s job is just teaching as well, and just making sure people are following the right patterns when using some of these things. And it can help overcome some of these barriers to entry, but obviously, it’s a large cost investment.

One of the reasons Cloudflare picked Go in the first place, and we continue to use it, is it just scales so well. We’ve had a couple of issues with Kafka consumers not being able to keep up with the amount of messages that are being passed through, but after some small tweaks that you would have to make in any language, we’ve been very easily able to scale a bunch of our services to tens of thousands of messages being read a second, without too much heartache. We haven’t had to do anything clever, we haven’t had to write any sort of crazy code to do so. It’s just kind of standard optimizations that – you know, a linter will probably help you with most of them, if I’m honest. So Go has been fantastic for that… And even people who join CloudFlare who haven’t got experience with Kafka and haven’t got experience with Go, we’re usually able to get them productive and writing decent Go in a week or two just because 1) how easy Go is to learn, 2) how easy it is to read, which I think is really important and often overlooked with Go. Being able to read it… I can pick up pretty much any Go service and I can follow it through, and I can roughly figure out what it’s trying to do. I can’t promise you the same thing for Java or PHP, where there’s lots of like auto-wiring and magic, and you have to understand the framework a little bit.

So that’s been really, really powerful and useful for us in terms of adoption, too. And generally, the performance of the Go app. We are a cloud, so we pay a lot of attention to resource utilization, and containerized Go services… I think they are so tiny in comparison to some of the other things that we have to run. And even quite complicated applications that we have running, that are processing a lot of data. Their footprint is tiny. So if we were to do this all again, maybe [unintelligible 00:59:44.25] some other languages, and to be completely clear, especially in the context of the conversation we’ve been having, there has been more and more adoption of Rust at Cloudflare. More teams are definitely starting to dip their toe in and figure out if that’s a good fit for them. And TypeScript, too. There has been a lot of TypeScript usage, especially in Cloudflare Workers, because it’s natively supported and it’s fantastic. But Go isn’t going anywhere. We see new Go services deployed every day, because it just does what we need to do incredibly well.

I think one of the huge drawbacks of picking go and taking this approach that we had - and it’s something that we’re sort of still reevaluating - is, as you probably infer from what I’ve said, we’ve invested a lot of time in tooling for teams that write Go. If you don’t like Go, we haven’t got actually a whole bunch to help you right now. We haven’t rolled the same libraries in Python, we haven’t rolled them in Rust. So we’re actually kind of making it harder for teams to blaze a trail and potentially do what’s right for them, because using Go is easy for them… Which is kind of by design. We want them to stick with Go until maybe it doesn’t make sense for them anymore, because we’ve got all this great tooling that’s got production-battled experience and it works.

But one thing we’d love to support in the future is the same sort of patterns, ideas for other languages. And so we’ve been exploring some interesting things like could we put gRPC in front of Kafka, and therefore generate bindings for further languages, and therefore we could benefit from the same tooling which was set just behind our gRPC server, but the teams who interact with Kafka need to know even less about Kafka, because we’ll handle the hard configuration for them? And then we can support these other languages, too. And the thing that keeps causing me to pause is exactly what Kris was talking about, is if we do this, we’re going to remove the need for teams to understand Kafka at all, if we do it well. And that sounds like a great thing, and it maybe is in the short term, but I just feel like it will bite us a lot in the long term if we don’t – this fundamental piece of infrastructure, if it becomes one team, like mine, who knows how everything’s configured and connected to it, and another team’s kind of passing through it without ever really truly understanding it, I don’t know if that’s actually optimal in the long term. So we’re still trying to battle that and figure that out, but… As of right now, it’s the only path I can see to scale all this great tooling [unintelligible 01:01:50.17] to a way that we can support all these other languages that are starting to appear in the Cloudflare ecosystem.

[01:01:56.27] I feel like you’ve already mentioned several things that potentially could be challenging in that sense. You’d mentioned you have to be able to consume your messages faster than they’re being written, but if you don’t really understand much about Kafka, and you’re just using this gRPC service, you might not really think about that or realize it. And then you combine that with the fact that even small things like messages being not necessarily in a specific order, depending on how things are set up… But knowing that that’s the case, if somebody’s not really expecting that, that might throw them off as well. So I could see where a lot of those little things, you almost want to force them to understand at least a little bit of it.

Yeah, I was gonna say, what popped in my head there is this interesting thing – so as I mentioned, I’ve been reading the REST dissertation, and what’s fascinating about is like a) obviously, there’s nothing about actual programming languages, or actual code… In fact, that’s kind of the point of it, it’s about architecture and behavior… But one of the really interesting things that comes out of REST, and the web as a whole, is that because there’s such a focus on the behavior of the data, and the connectors, and the components, which are the things during the processing, you wind up with this ability to kind of slot things into places, and have this dynamic infrastructure that just kind of works across everything. It’s actually one of the things that makes Cloudflare be able to be a company at all, the fact that caching is this first-class citizen of REST, and this thing where it’s like yeah, you can have CloudFlare as a cache, you can have Akamai as a cache, your browser has cache in it, the servers have caches on them, and they’re all doing the exact same thing, and it’s kind of this uniform piece of functionality. But the way that happened is because the behavior of the thing was defined, and it says “This is how this component behaves. This is what this component does.” And I think that could be a pathway to saying “Well, for Kafka - yeah, there’s all of this stuff that it does, but we’re going to shrink its interface; we’re going to shrink the behavior of this component down to this specific thing, and then put a uniform interface in front of it, whether that’s G RPC, or HTTP or whatever.” And then you have to worry less about like the intricacies of Kafka and more about “Are we adhering to whatever this interface is?” So that enterprise might say “Well, we have strict ordering for all of our messages”, and then you have to figure out how that component does that. And for what Kafka can do, use Kafka, but if you need to do it with something else, then you swap it out with another technology that can meet that same behavioral constraint.

And I think that level of thinking, which I think is kind of where you’re starting to get to, Matt, is one of the things I’ve found to be missing in a lot of companies, where it’s like, they do focus a little bit too much on the technology. So as important as I said it is to learn the technologies, I think that’s also why it’s important to learn the underlying concepts and theories of things, so you can say “Oh, this is the behavior that we care about. Let’s make something that always embodies that behavior. And if the technology were using no longer suits our needs for that behavior we need, then we know we’ve got to go find something else to give us that behavior we need”, which can also help you understand when you maybe need to move away from something like Kafka, or maybe when you need to move to something like Kafka.

Yeah, I think you’ve hit the nail on the head, honestly.

I did have another question about Protobufs, and how are Protobufs managed at Cloudflare? This is always the thing about protocol buffers, is people are like “These things are amazing”, until you actually have to manage how they get developed, and where they’re stored, and how they get distributed to everybody, and all of that stuff.

Yeah… So my number one tip if you’re gonna get started with Protobuf is, even if you’re a small company, centralize them. Put them in one repository, put them in one place, and treat them like code. So we have a – we’re not very good at naming things at Cloudflare… We have a repo called Proto Schema. It does what it says on the tin. And so all of our Protobufs live there. And anyone can make a pull request, but our team notionally owns Protobuf at Cloudflare, so no one gets to merge a Protobuf without approvals from our teams. And I think that’s important. That means we have control and power over stewardship, ownership, breaking change detection… So we’re the first pass; we’ll look through your code and make sure you follow some guidelines and rules.

[01:06:06.21] We also have linters in place. We use a mixture of two things now. So there used to be a tool that was from Uber called Proto Tool. I think the people who wrote Proto Tool at Uber left, and created Buff, which is like an open source tool around Protobuf management, effectively, and we use that, too. And what it does is every time someone makes a pull request, it checks a bunch of things for us. It checks how they followed our naming conventions, have they generated code for all the languages we support? [unintelligible 01:06:28.26] but it does breaking change detection as well.

And so effectively, before you even – you might be making a one-line change to a Protobuf, but already we’ve done like a manual check, we’ve done naming checks, we’ve done breaking change check, and we’ve done some sort of other lints as well. Until you have all those approvals in place and you’ve merged it, you can’t make a change in a Protobuf.

That might seem heavy-handed, but from my experience, all of those things are necessary, and I really recommend doing that if you’re going to start treating these as schemas. You should treat them like database schemas; that’s the way I like to think about it. You wouldn’t allow someone to make a migration in your database without having a bunch of checks or approvals, and you should treat this the same way, because they’re effectively going to be used in a very similar way, where people depend on a contract being there.

So we essentially manage them like that, and then when you merge them, it generates all the code for you, depending if it’s going to be used for gRPC or for Kafka. For Kafka, it just generates some Go stuff; just some very light Go code to enforce some of those things I discussed, around you can only emit one message to one topic, and stuff like that. If it’s gRPC, we generate for Python, PHP, Go, I think Rust too now, and then they become – effectively, we then tag those as a later version, using semver, and then anyone can pull those in. We’re very, very keen on all changes being forward and backwards-compatible. It’s super-important.

And then one tool we’re going to look at in the not so distant future is once those tags are created, it will then go and create pull requests against the services using it to let them know there’s a new version release. We just try and encourage them to update their – particularly the servers; the servers are the most important ones to update. Because that’s one of the things, you’ll hear people say that Protobuf is forward and backwards-compatible often… Which is true, it is. But if the server receiving the message hasn’t been updated to understand that there’s a newer version of the contract, there could still be solving some funky behavior, especially if you’ve a field that it’s never seen before and it doesn’t understand. So there’s still some [unintelligible 01:08:18.28] and it’s not perfect by any means, but for any schema that you’re working with, I really do recommend centralizing them and assigning an owner, and I think that will set you up for success.

Okay, I guess that means it’s time to move into unpopular opinions, unless anybody has anything else they need to add…

Nope, don’t think so.

Alright, Matt, I believe you have an unpopular opinion you’d like to share with us.

Yeah, I think this would have been a popular opinion until maybe a year or two ago… But my unpopular opinion today is that I think Twitter, or X, or whatever want to call it - I’ll never stop calling it Twitter - is actually a really great place to hang around, specifically as a Go developer. And maybe you don’t have the best view of Elon Musk, or maybe you’ve heard some horror stories about other subcultures on Twitter, but one thing I will say is the Go community there is excellent, and it’s really friendly. I’ve had nothing but good conversations and good support there.

Twitter added support for these things called communities, and I made a Go one; it’s got like 3,500 members; it’s super-friendly. So my unpopular opinion is Twitter is still pretty great, and if you’ve been avoiding it because you think that it’s toxic and falling to pieces, as a Go developer I think it’s still quite a nice place to hang out, once you’ve curated your feed a little bit and made sure you’re following the right folks.

I feel like you could have made your unpopular opinion even more unpopular by saying X was the better name…

[laughs]

I’m actually curious how that would do in a poll now… Because I feel like almost everybody just wants to stick with Twitter, because nobody wants change of any sort…

There’s a few things about changing to X that just do not work. Like, the concept of a retweet is a good one, and they’ve changed it to Reshare. And obviously, even writing a tweet - apparently, now they’re called an X. That doesn’t make any sense. That doesn’t work. So yeah… I’d like to say it’s a better name, and I wish that was my unpopular opinion, but it just doesn’t work for me, I’m afraid.

Yeah, I think it’s less that you can’t change the name of something that has this kind of cultural significance, I think it’s that you have to actually change it in a way that creates something with equal cultural significance, and probably isn’t trying to just stomp all over the ground that already is there. Calling it X is weird, because X means so many things already… There’s just not good space there.

I will say, the one thing that always confused me was that building a service where people actually call the act of using your service by its name, like tweeting, or googling, or doing something like that is incredibly challenging… And the goal of a lot of companies is to essentially get to that point, where they are the standard for that. So I’m going to tweet you, I’m going to google that, whatever it is. And then to rebrand that to something else when you have that seems extremely odd to me. You’d never see Google being like “We’re going to change the name of the search engine to something else.” Like, they might change the name of the company and do all this other stuff, but they’re not going to change the search engine. Like, they still want people to google things.

Yeah. You know, sometimes people have other reasons for doing things, or not the best ideas… But I don’t know, maybe one day X will catch on, but I doubt it. I don’t think it will be as ubiquitous as Twitter and tweeting… But I also think that’s because of what Twitter meant when it was kind of surging to popularity, and how it was different from so many other social media platforms, that it’s not inherent in the name, it’s just that it had a useful name, so people were like “Oh, okay.” And quirky features that – you know, subtweeting is a thing, but that’s because of the way you could produce posts, that were talking about somebody without actually talking about them, or like adding them. Or even just adding; that’s a whole thing that Twitter developed because they started using the @ sign so heavily in usernames, that other people now copied. But…

I can’t really – as far as, Matt, your opinion goes, I have very mixed feelings about Twitter, so I’m on the fence there. I find it good for the Go community, like you said, but I definitely can understand that – or I definitely see how other parts of it are exceptionally toxic. So it’s very much a use it for limited things.

But it does a good job of trying to stop you from doing that too, right? It’s got like the For You tab, where it curates stuff that sometimes feels like it’s there to try and annoy you a little bit, and to bait you… So I’m interested to hear where’s the divide for you? When will it be too negative that you’re like “Actually, this isn’t worth it. This isn’t worth the fact that I do get some value out of it. I’m gonna start looking elsewhere.” Because I think a lot of people felt like they passed that and then came back. I think a lot of people felt that they passed it and really didn’t come back. They did find other homes on the internet. So where’s yours at for that?

I know for me a lot of it comes down to notifications, and things like that. I know by default the Twitter app gives you all sorts of notifications for “We think you’d be interested in this tweet”, and things like that. And I had to shut all of that off, because like you said, it was kind of baity, or it was things that I didn’t care about… And when I narrowed it down to just the people I want to follow in the Go community and things like that, it’s really helped.

I think the minute it gets to the point where I can’t just get on and in a couple of minutes see Go-related stuff, I think is when I have an issue. Because if I have to go through a bunch of weeds to get to the actual thing I want, I’m basically just going to stop using it; it’s not necessarily going to upset me, it’s just going to be like “This isn’t worth getting on anymore.”

[01:13:54.02] Yeah, I think for me it’s like – I don’t use Twitter as much as I used to, for sure, but I don’t know if that’s because of the whole change to X and all of that, and the trashing of the algorithm, or if it’s just because I am less interested in consuming social media as I have been in the past… Now that I’m kind of on this whole new adventure, and I’m living in papers from a quarter century ago, or further back - I read one from 1986 the other day - I think I personally have shifted so that social media and the hype cycle that is usually around it is far less interesting to me than it was before.

Yeah, that makes sense.

Okay, Kris, do you have an unpopular opinion you’d like to share?

Yes. I have several, as per usual. I have one that’s related to social media, but I don’t think it’s developed enough yet to be a broadcastable one, so I’ll save that one for a future… The one that I do have, that I’ve been thinking about for a few weeks now - and maybe I’ve said it before…? I don’t think I’ve said it before though. But I think that our nice little catchphrases that we have, we should just put them all in the dumpster. The most notable one is “Don’t reinvent the wheel.” I think that is an egregious piece of advice for people.

On a surface level, we constantly reinvent wheels. We’ve been doing that since we invented the early wheels. Wheels on chariots in ancient Rome look nothing like wheels on F1 cars. So we’ve reinvented the wheel over and over and over again, and we constantly improved it. But I don’t like that phrase specifically because it’s so anti-innovation. It’s like “Oh, well the things we’ve already created - that’s not where we’re going to innovate. It’s on the things that don’t exist yet.” But there’s essentially nothing that doesn’t exist yet for most purposes. It’s like, you’re not going to invent the next internet, or the next telephone, or the next television, like “Truly, this is a new thing.” You’re gonna improve something that already exists.

So I think when people just claim “Don’t reinvent the wheel”, they’re kind of missing out on opportunities to create newer and better things. And also, I think the counter argument to that might be “Okay, well, that’s really about focusing on what your main thing is”, especially like for a startup or a company. But to that, I would say - most companies that exist in the long term pivot away from whatever their initial main idea was. A great example of this is that American Express started off as a package company. They were like Federal Express, FedEx, or they were like UPS. They were shipping things, and then they sprung off into finance, and now they are one of the world’s largest finance companies, and one of the major credit card companies in the world.

So to think that you should only focus on whatever your main thing is right now, and not pay attention to the things around you that you might be able to innovate and change on is also bad advice for people. So I think we should kind of dispense with quips like that, and stop saying “Don’t reinvent the wheel”, and actually find more nuanced words for what we’re trying to explain to people, whether that’s “Hey, that might be a good idea, but we don’t have the bandwidth to focus on that right now.” Or “Hey, that sounds like a good idea, but you need to develop it out more before we can actually go and pursue it”, or whatever the thing is that you’re trying to explain to the person as to why they shouldn’t be doing it. I think just the little quips of “Don’t reinvent the wheel”, or the even more popular ones like “Don’t repeat yourself”, they mean so much at this point that they mean basically nothing, and they are, I think, actively hostile and holding us back.

I think those positive ones were right. The ones you presented I think are – so I’m not saying I disagree with the ones you shared, because I think you’re right, but one I really like is there’s a famous quote from Jeff Bezos. And people asked him “If you’re trying to build a –” It’s not often I quote Jeff Bezos, but if you’re trying to build a company the scale of Amazon, what’s your advice? And his response was something along the lines of “Focus on what makes your beer taste better.” And I really liked that quip. I think it makes sense. It’s like, if you’re building a startup, don’t try and build your own old system to start with perhaps; focus on your key value, focus on your core concept and what you’re trying to innovate on, and what you’re trying to do.

[01:18:14.23] And I think that small quip can actually focus your attention in time on what you might be working on, and I think it can be a positive thing in that instance… Whereas I completely agree with sort of the “Don’t reinvent the wheel.” It’s like “Well, actually, we probably should reevaluate our wheel here, because it’s not working for us.” I can get behind that one specifically.

Yeah. I think there are good quips, but it’s difficult to find the good ones out of the bad ones. I think we should – I’m not saying we should get rid of all of the quips, I’m just saying we should reevaluate – [unintelligible 01:18:41.09] I think we should reevaluate all of the quips that we have, because there are some that just hit really hard and really good… Like, CVS’es mission statement is “Helping people on their path to better health.” And I think that’s such a phenomenal company mission statement, because it’s like – everybody in the company, it’s like “I don’t know, is what you’re doing helping people on their path to better health?” And it’s like “Oh, yes or no?” It’s an easy thing to grok and pull yourself towards. And so I think those ones that inspire you to think more deeply and have more deep conversations are the quips that we should keep, as opposed to the ones that are more used to shut down conversations as they often and aggressively are used.

I can definitely see how the “Don’t reinvent the wheel” one especially is – by definition, it is saying “Don’t go do something.” And I’ve found this with myself; I remember having a conversation with somebody and they’re asking for like help about something… You sometimes have to figure out “What’s the context of this? And what are your goals?” Because it’s far too often somebody will be on the internet and they’ll ask a question of “How do I do this?” and people will say “Oh, just use a service for that”, or “Don’t do that.” Basically, they immediately dismiss them of “Don’t reinvent the wheel.” A common answer for that is people ask how to build an authentication system. I think I see this on the Golang Subreddit like once a month, or maybe once every two months… And it’s almost always unanimously people saying “Don’t do it.” But in reality, some people need to understand how that entire process works. Otherwise, we get to a point where nobody knows how it works. And it’s like “Well, that’s not good.” So if their goal is to learn and educate themselves and get a better understanding of everything, that’s actually a good thing to do. There’s nothing wrong with that.

So I agree with you that it’s not good to just dismiss people instantly of “Oh, that exists. Don’t do it.” But I think it’s better to be more aware of what people’s goals and intentions are, and sort of going from that perspective. Because like you said, there are ways to improve things, but if we never are doing it, then that would be a little bit odd.

Yeah. I think a good counter – maybe not a good counter, but a thing that I’ve been certainly asking myself as I try to build a whole new thing is, instead of framing it as “Don’t reinvent the wheel”, or “Focus on the main thing”, I just asked myself “If not now, then when?”

[01:21:00.20] Like, there’s this thing I’ve gotta do – authentication is a perfect example. It’s like, okay, well, we have to do some sort of authentication. Should I spend time learning, or should I just pull the thing off the shelf? And so I ask myself, “If I don’t learn this now, when am I going to learn it? Or when am I going to figure out how to build it?” And if there’s an answer for that, if it’s like “Oh, in six months, that’s when I know I’ll be able to have the time to actually sit down and learn this well”, then okay. There is a when. I don’t have to do it now. But if there’s not an answer, then that probably tells me that if I don’t do it now, then I’m never going to find the time to do it, and that’s not great.

So yeah, any quip that inspires you to think more deeply, or leads to more questions is a good one. But I think far too many of the ones that we have in tech are just used to shut people down on things, or used to deter people from doing things that they’re interested in doing… Which I think leads to kind of the thing that we were talking about before, of people don’t know the depth of things because everybody’s telling them “You don’t need to know that. Somebody else knows that.” Cryptography is a good example. Don’t roll your own crypto. Don’t put your own crypto in production, unless you’re a cryptographer. But sure, go implement the algorithms, figure out how that works. It’s useful knowledge to have.

Don’t roll your own crypto…

Yes, do not put your own crypto in production.

“Don’t roll your own crypto” is always such a weird one, because it’s like, somebody had to know crypto for it to exist in Go. Somebody had to have understood it. So somebody has to understand crypto. So that advice is definitely not universally true. But people just love to – I get it, because it’s one of those things where you want to tell somebody “If you don’t understand it very well”, then like you said, “don’t push that into production and expect things to go well.”

Yeah. But the discouragement of learning is the thing that I have a lot of annoyance with, because I think people think that there’s like – oh, there’s a point in time when you go and you learn. Or they think like “Oh, you learned in school, and now you’re in the industry, so now you don’t learn.” Or maybe you learned when you were a junior, and now you don’t need to learn. No, this job is learning. Like, if you want to build software, your entire – all of what you’re doing is learning, and you will be continually learning, and everything you build will teach you things, and you’ll have to learn new things to build new stuff. That’s just what it is. So I think the discouragement of learning is what irks me a lot of the time, and I feel like “Don’t reinvent the wheel”, or even some of the principles we have, I’m just like –

I think I might replace all my quip usage with “If not now, then when?” I love that. I think it fits pretty much every situation I can think of. I might just start using that all the time.

I was so happy when I – because what was I doing…? I was doing something with, I think email, or designing my website, or something, and I was just like “Man, I should just go use Substack”, or something. And then I asked myself that question, and I was just like “Oh, this is a good way to refocus myself.” Because there have been a whole bunch of things where I just said “Oh, there is a when. That when is like four months from now”, or a year from now, or whatever from now. Yeah, I want more quips like that. I want more of that nice little nuanced tidbit.

Alright, I think that sums it up. Matt, thank you for joining us. Kris, thank you for helping me host. Hopefully, everybody listening learned something about Kafka, even though we kind of went all over the place on event-driven systems, and everything… But I felt like that was kind of inevitable, because Kafka itself is hard to dive into in too much detail without exploring some of those other topics, and seeing how they relate.

Alright, thank you guys for joining me.

Thanks for having me. Bye.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00