Ashley Jeffs shares his journey with Benthos, an open source stream processor that was acquired by Redpanda. We talk about the evolution of data streaming technologies, the challenges he faced while growing the project, the decision to bootstrap versus seek venture capital, and what ultimately led to the acquisition. We discuss reactions to licensing changes, what it’s like to have your thing acquired, the challenging yet fulfilling nature of open source work, what’s next for Benthos, and what it takes to enjoy the journey.
Matched from the episode's transcript 👇
Ashley Jeffs: And yeah, they just kept adopting it. So we had success story, success story… And then I was allowed to – I was still almost exclusively working in my spare time. So I would build stuff in my evenings and weekends that I knew teams would want. I hadn’t been asked to build it, I just kind of saw they were going to – “Hey, this team over there… They’re going to need this thing.” And I just kept adding stuff to it. And what we ended up with is a service that is basically just plumbing. So it’s just taking data from one place to another. But because it’s so flexible, and the abstractions within it are so composable – like, you can compose all these different concepts, and just build build out as complicated a pipeline as you actually need, but the bare config is just so simple anybody can get on with it. So you can very slowly learn the concepts over time… That we just ended up deleting services that were stateful, that were doing all this complicated stuff. We were just getting rid of it and replacing it with a Benthos config, and it would be fine.
So there would sometimes be a concern of like – to give you a tangible example, something that might actually exist in the real world, rather than abstract concepts. Like imagine doing stream joins, where you’ve got one Kafka topic, another Kafka topic, and you need to join them by some common ID. And ordering isn’t – like, you don’t know which one’s going to arrive first. You could run a smart stream processing service that’s got like a window, maybe you’ve got some sort of window [unintelligible 00:22:00.01] going to be enough to cover the potential difference in timing. And what it will do is it will just join the data by ID, and then you end up with a new feed. That’s going to have state, because it’s inevitably going to need like a disk persisted way of maintaining everything, all the calculations it’s doing, all the aggregates it’s doing, and it’s probably going to store that in S3, or something.
What I would do is just say “Hey, that join - you could just use that Redis cluster that you already have deployed, and use Benthos to cache data from one of the feeds, and then the other feed you could just treat as like the canonical feed, and use that cache to obtain the stuff that it’s supposed to be joined to. And when it doesn’t do that because it’s too early, and the data hasn’t arrived yet, you just pop it in a dead letter queue with a time delay on it.” And that’s just config. Like, that’s just a very simple config. And when you’ve deployed it, what you have is – so what you’re telling the operations people, the people who actually have to like get paged and wake up to play with these services when they fail - all they have is a Kafka consumer that can just be restarted, because it’s got delivery guarantees. So if it’s having a [unintelligible 00:23:15.12] and it’s not operating as it should, you just restart it. Or if it crashed overnight, it’ll just come back online and automatically pick up where it should, replay data that it hasn’t finished with yet. And then the Redis cache - it’s just a Redis cache. So you already know what that looks like. You already know what it looks like to have redundancy there, and backup, and stuff.
And it’s kind of flipping the problem on its head, because what you’re doing is you’re saying that the thing that’s streaming, the live feed, this thing that you want to keep low latency, and is very important, it’s dealing with hot potatoes - that’s actually stateless. And the state exists somewhere else that’s kind of like your more stable components. They’re not really hot. The Redis instances, they’re not necessarily going to be hot if you scale things out properly, depending on the use case, that kind of thing. But the idea is that now you’ve kind of eliminated complexity, where before there was some… And there’s performance implications. So it might be that by doing that, you’ve now spent 10 times as much money on the storage, and all this other stuff.