In this episode Matt joins Kris & Jon to discuss Kafka. During their discussion they cover topics like what problems Kafka helps solve, when a company should start considering Kafka, how throwing tech like Kafka at a problem won’t fix everything if there are underlying issues, complexities of using Kafka, managing payload schemas, and more.
Matthew Boyle: So I think [unintelligible 00:55:00.06] good or bad, but I guess just kind off a brief story of how we’ve been using it at CloudFlare, which has been interesting… So firstly, I did mention that we’ve been using Protobuf for our schemas. The support for Protobuf and gRPC in Go is excellent. It’s first-class. So that was a good fit and a good choice, and I would make that choice again. So for schema management, Protobuf’s definitely worth looking at, especially if you are a predominantly Go place.
Something else we did is we created a Go library that we call Message Bus. Very creative. And effectively, what we do in this is we use – it used to be run by Shopify; it’s a Go library called Sarama. Sarama is a Go library that was created by Shopify, that basically allows you to do pretty much everything with Kafka. And I actually credit that library with an awful lot of the adoption of Kafka, both at Cloudflare and elsewhere where Go was involved, because it enabled you to basically do everything that the Java libraries were doing.
One really hard thing about Kafka as we talked about is configuring it can be hard. So one good choice that we made as a company, I think, is we made it an opinionated library, that kind of set up a very good set of default settings and constraints for how we think you should interact with Kafka at CloudFlare, and made it as easy as possible for you to do it. It’s got a bunch of power user settings, if you will, where you can override what we deem to be the best settings, but that was a pretty good choice, I think. And we added a bunch of Prometheus metrics within that library as well, so it means that everybody who pulls in our library gets this dashboard for free of how their Kafka service is performing… Which was very, very helpful, and again, is another thing I’d recommend doing. It’s not Go-specific, you can do it in any language, but we were able to do it with Sarama.
Slight tangent, but Sarama actually got picked up by IBM. So IBM is now responsible for the maintenance of Sarama, because it turns out that Shopify aren’t using it too much anymore. So IBM have taken over stewardship of it. So that was a really cool thing to do… I haven’t checked in on the project in a while, to see how it’s progressing, but it was excellent that they put their hand up to carry on stewardship.
And then the final thing that we have been using Go in Kafka for is we built this thing called – we call them connectors, and they’re built on… There’s a framework called Kafka Connectors, which effectively allows you to plug some code into your database, [unintelligible 00:57:09.24] into Kafka, and then it just like moves the data between the two. So when people are trying to take things out of a database and push it to Kafka, Connectors are a pretty common way to do that.
We built our own framework that we also call Connectors; it’s all written in Go, and effectively with a very small configuration file you write in YAML, we allow you to specify a reader, some transformations to apply, and then a writer. And so what this means is teams can deploy very simple code that reads from a database, applies some transformation to a Protobuf format, and writes it to a Kafka topic. They can do it without actually writing any code; they just create some environment variables and deploy it. And same thing - we’ve got Prometheus metrics, you get a dashboard for free, and you get some alerts around it for free, and stuff.
[00:57:49.17] So all of these things have really helped with Kafka adoption. And I think if you’ve got the resource to deploy Kafka at your company, I would really consider having a team like mine, a platform team that provides tools and services that makes it easy for other teams to do the right thing, and to teach them, too. I think a huge part of our team’s job is just teaching as well, and just making sure people are following the right patterns when using some of these things. And it can help overcome some of these barriers to entry, but obviously, it’s a large cost investment.
One of the reasons Cloudflare picked Go in the first place, and we continue to use it, is it just scales so well. We’ve had a couple of issues with Kafka consumers not being able to keep up with the amount of messages that are being passed through, but after some small tweaks that you would have to make in any language, we’ve been very easily able to scale a bunch of our services to tens of thousands of messages being read a second, without too much heartache. We haven’t had to do anything clever, we haven’t had to write any sort of crazy code to do so. It’s just kind of standard optimizations that – you know, a linter will probably help you with most of them, if I’m honest. So Go has been fantastic for that… And even people who join CloudFlare who haven’t got experience with Kafka and haven’t got experience with Go, we’re usually able to get them productive and writing decent Go in a week or two just because 1) how easy Go is to learn, 2) how easy it is to read, which I think is really important and often overlooked with Go. Being able to read it… I can pick up pretty much any Go service and I can follow it through, and I can roughly figure out what it’s trying to do. I can’t promise you the same thing for Java or PHP, where there’s lots of like auto-wiring and magic, and you have to understand the framework a little bit.
So that’s been really, really powerful and useful for us in terms of adoption, too. And generally, the performance of the Go app. We are a cloud, so we pay a lot of attention to resource utilization, and containerized Go services… I think they are so tiny in comparison to some of the other things that we have to run. And even quite complicated applications that we have running, that are processing a lot of data. Their footprint is tiny. So if we were to do this all again, maybe [unintelligible 00:59:44.25] some other languages, and to be completely clear, especially in the context of the conversation we’ve been having, there has been more and more adoption of Rust at Cloudflare. More teams are definitely starting to dip their toe in and figure out if that’s a good fit for them. And TypeScript, too. There has been a lot of TypeScript usage, especially in Cloudflare Workers, because it’s natively supported and it’s fantastic. But Go isn’t going anywhere. We see new Go services deployed every day, because it just does what we need to do incredibly well.
I think one of the huge drawbacks of picking go and taking this approach that we had - and it’s something that we’re sort of still reevaluating - is, as you probably infer from what I’ve said, we’ve invested a lot of time in tooling for teams that write Go. If you don’t like Go, we haven’t got actually a whole bunch to help you right now. We haven’t rolled the same libraries in Python, we haven’t rolled them in Rust. So we’re actually kind of making it harder for teams to blaze a trail and potentially do what’s right for them, because using Go is easy for them… Which is kind of by design. We want them to stick with Go until maybe it doesn’t make sense for them anymore, because we’ve got all this great tooling that’s got production-battled experience and it works.
But one thing we’d love to support in the future is the same sort of patterns, ideas for other languages. And so we’ve been exploring some interesting things like could we put gRPC in front of Kafka, and therefore generate bindings for further languages, and therefore we could benefit from the same tooling which was set just behind our gRPC server, but the teams who interact with Kafka need to know even less about Kafka, because we’ll handle the hard configuration for them? And then we can support these other languages, too. And the thing that keeps causing me to pause is exactly what Kris was talking about, is if we do this, we’re going to remove the need for teams to understand Kafka at all, if we do it well. And that sounds like a great thing, and it maybe is in the short term, but I just feel like it will bite us a lot in the long term if we don’t – this fundamental piece of infrastructure, if it becomes one team, like mine, who knows how everything’s configured and connected to it, and another team’s kind of passing through it without ever really truly understanding it, I don’t know if that’s actually optimal in the long term. So we’re still trying to battle that and figure that out, but… As of right now, it’s the only path I can see to scale all this great tooling [unintelligible 01:01:50.17] to a way that we can support all these other languages that are starting to appear in the Cloudflare ecosystem.