Changelog Interviews – Episode #606

Reinventing Kafka on object storage

with Ryan Worl, Co-founder & CTO at WarpStream

All Episodes

Ryan Worl, Co-founder and CTO at WarpStream, joins us to talk about the world of Kafka and data streaming and how WarpStream redesigned the idea of Kafka to run in modern cloud environments directly on top of object storage. Last year they posted a blog titled, “Kafka is dead, long live Kafka” that hit the top of Hacker News to put WarpStream on the map. We get the backstory on Kafka and why it’s so widely used, who created it and for what purpose, and the behind the scenes on all things WarpStream.

Featuring

Sponsors

SpeakeasyProduction-ready, enterprise-resilient, best-in-class SDKs crafted in minutes. Speakeasy takes care of the entire SDK workflow to save you significant time, delivering SDKs to your customers in minutes with just a few clicks! Create your first SDK for free!

Supabase – Supabase just finished their 12th launch week! Check it out. Or get a month of Supabase Pro (FREE) by going to supabase.com/changelogpod

Paragon – Ship native integrations to production in days with more than 130 pre-built connectors, or configure your own custom integrations. Built for product and engineering. Learn more at useparagon.com/changelog

Unblocked – Other developer tools can’t tell you how your codebase works and why. Unblocked can. We augment your code with context from Slack, Confluence, Jira, and more, so you get accurate answers without having to search for them. Sign up for free at getunblocked.com

Notes & Links

📝 Edit Notes

Chapters

1 00:00 This week on The Changelog 01:37
2 01:37 Sponsor: Speakeasy 03:32
3 05:09 Start the show! 01:18
4 06:28 What is Kafka? 01:44
5 08:11 Kafka use cases 02:28
6 10:39 Long live Kafka! 02:06
7 12:45 Kafka on prem? 02:16
8 15:01 What makes it hard to run? 02:44
9 17:45 A margin of haters 05:33
10 23:18 We met at Percona Live 04:57
11 28:14 Sponsor: Supabase 03:24
12 31:39 Object Storage sercrets 04:42
13 36:20 Kafka brokers 09:53
14 46:13 Downsides besides latency 03:25
15 49:38 Could you R2? 00:43
16 50:21 Very good demo 02:47
17 53:08 $ warpstream playground 07:02
18 1:00:10 The path for greenfield 02:27
19 1:02:37 Why not open source? 14:17
20 1:16:54 Sponsor: Paragon 02:56
21 1:19:50 Sponsor: Unblocked 02:01
22 1:21:52 Why not bootstrap? 10:48
23 1:32:40 Let's talk pricing 03:17
24 1:35:57 Deep into pricing 04:24
25 1:40:21 A good next step? 01:43
26 1:42:04 Closing thoughts and stuff 02:05

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Alright, today we are joined by Ryan Worl from WarpStream. Ryan, welcome to the Changelog.

Thanks. It’s great to be here.

Great to have you. Shout out to listener Vladimir for requesting this episode. Also, shout-out to your co-founder, Richard, who unfortunately couldn’t be here today, but… Hey, Richard.

What’s up, Richard?

Yeah. But you’re here, so let’s talk to you and not to Vladimir and to Richard. That being said, Vladimir requested this episode; you too, listener, can request episodes. Head to Changelog.fm slash request. Let us know what you would like to hear about on the pod, and we might just fulfill your every desire.

Vlad wanted to hear about WarpStream, and so that’s why Ryan is here. It just so happens that Adam and I both would also like to hear about WarpStream, so… Here we are.

Let’s start with Kafka though, because it sounds like WarpStream’s story starts with Kafka’s story. What is Kafka, besides an author from the early 1900s? …but the open source thing - what is that thing all about?

Yeah. Kafka is both a very interesting and a very boring system. The easiest way to think about it is it lets you create topics, and you can have producers that write messages into these topics, and consumers that consume messages out of the topics. It’s kind of like a publish and subscribe type deal. But the thing that makes it interesting is the fact that once you consume those messages, they’re not deleted. So they’re still stored inside the system, and another consumer can go and read them, again, for a different purpose. If you have two different applications that are consuming the same dataset, they can both equally consume those messages. Say that you have one application that does machine learning training, and another that does alerting based on the two different – the same messages, you want to process them, but you want to process them in different applications, and Kafka is a useful tool for that.

It also provides ordering for those messages, so that if you need to implement an application where you send messages in a certain order, and you want that order to be retained on the other side, Kafka also does that for you. Each message is assigned a unique offset within a partition of that topic, which is kind of like a shard. And within that shard, if you process the messages in the same order again, or if you process the messages in that partition again, you’ll get them back in the same order every time. So you can implement something like state machine replication, or that type of thing where the ordering matters.

Okay, so what are some use cases for this? It sounds like lots of well-funded companies use it, larger companies… And I think that some of that is because of the operational complexities, and the love/hate relationship with it. But why are people grabbing this particular tool often?

Yeah. The reason why it’s useful is there just isn’t a lot out there that fulfills the two main things. It’s like a publish and subscribe mechanism that’s scalable. And then also that lets you have different consumers process the same set of messages without one of the consumers deleting it. Like, there’s a lot of queuing systems that the messages when you consume them once, they’re just gone forever at that point. The purpose is to consume the message and have it go away, not to reprocess it again in the future.

There are a lot of use cases for it. I’d say that the most broadly popular one is for moving data from point A to point B; kind of like a dump pipe. It’s used a lot in observability and security-related workloads, where you have a lot of application servers that are generating logs, and you want to temporarily put those logs somewhere before you put them in something else. Say you want to put them in Elasticsearch, or something like that. Elasticsearch can be a little finicky, so you want to have Kafka, which is a much simpler system in place, as a temporary buffer to hold those log messages that you want to write to Elasticsearch in case that Elasticsearch cluster is down, or you’re doing an upgrade, or something like that. There’s a lot of different reasons for it, but Kafka is pretty much the de facto standard for those kinds of workloads.

[00:10:00.22] And then when you get outside of observability and security, there’s a lot of people that are building custom applications on top of Kafka, like an inventory management system for a warehouse, where you want to keep track of the real-time status of everything going on in the warehouse, and you might want to send messages to say “Oh, this new batch of inventory has been added onto the shelves of the warehouse. I’m taking things out”, and then you’re computing some type of a live application based on that inventory data to say - you know that you need to replenish the stock when it goes below a certain amount, but you want to do that in real time, so that you can react faster than just doing this once a day. Something like that.

So Vladimir pointed us to a post which I think Adam and I, we had both already read this post, because it was last year, last summer, I believe. “Kafka is dead. Long live Kafka.” This was your big coming out party, it seems. A great way to introduce WarpStream. And in that post, you said that Kafka is one of the most polarizing technologies in the data space… Whether it was you or Richard who wrote that. Then you just moved on and kept going, assuming that we all just knew why, or how, or agreed that that was just true. I assume it’s true; it’s probably polarizing. But why is it polarizing? My guess is because it’s useful, but difficult to use, and so people love it and hate it. But maybe there’s more to it than that.

So I think that there are probably two main criticisms that people have of Kafka. The first is that it’s hard to run. As the operator, you have to have a lot of knowledge about how to use the open source project appropriately. And the second major issue is the cost. I’m sure we’ll get into this, but the cost of running open source Kafka in the cloud - it’s pretty high compared to what people expect it to be. If you think of it as a dump pipe, you would expect to pay dump pipe type rates for it. But given the fact that it requires triply replicating the data onto local disks, and you have to pay – you know, most of the cloud providers are charging you money for inter zone replication… You end up paying a lot more than you expect, even if you’re just storing the data temporarily. If you’re using open source Kafka in AWS, for example, the minimum cost for a highly available 3-AZ setup for the cluster is 5.3 cents per gigabyte, compressed gigabyte, written into the cluster. That’s just to do the replication part. The storage part is all another story. It depends on how long you want to store the data for. But if you’re starting out and that’s your baseline cost, it can get pretty expensive, pretty quickly.

Is anyone building or using Kafka, open source Kafka, as you said, in a scenario where they’re not on public cloud, where they’re building out their own infrastructure, where it’s probably maybe even more harder because you’re literally managing the disks, you’re not ordering the disks, or SRA-ing the disks? …you’re literally managing the disks. Is that a scenario that happens, or is it less likely? So that’s definitely a thing that happens. I know of companies that that do that.

But just as the migration to public cloud over the last 10 years has only increased in velocity, essentially, that is becoming less and less popular, because it is indeed hard. And it’s even harder when it’s in your own data center, as opposed to the cloud, where you can just ask for more disks, and you get them right away.

The cost situation is a little different there too, because typically the way that you’re provisioning network in your own data center would not end up with a per-gigabyte cost. I mean, obviously, you amortize everything over how much data you’re transferring inside your data center, but you’re buying it in terms of hardware, and your per-gigabyte rate if your traffic goes up doesn’t correlate the same way linearly as it does with Amazon. But that’s definitely still a thing people do, but it’s less and less popular every day.

[00:14:03.00] Continue with the polarizing. What else is polarizing about Kafka?

Some people have some strong opinions about the actual developer programming model of Kafka, and that it’s a little hard to use sometimes. I think that’s less of a big deal these days as more tools have integrated with Kafka; it makes it even easier to use Kafka than – there are some other systems that might have a theoretically easier to use programming model… But everything speaks Kafka now, so those concerns are mostly trumped by the fact that it’s the de facto standard.

I think really what most people are concerned about when – like, if you don’t use Kafka today and you’re thinking about bringing it into your company, the two things that you’re going to be concerned about are how hard is it to run, and how much is it going to cost. Those are typically people’s two big blockers. It doesn’t have anything to do with the fact that conceptually they have an issue with Kafka, it’s just those more practical things.

What makes it so difficult to run? Is it the SSDs? I think that post also called it finicky… Is it poorly architected? Why is it finicky?

It’s a number of different things. I think the first one is - yes, being responsible for anything that stores data on local disks. If you want to achieve high availability and high durability of your data, it’s challenging. It requires experienced SREs to like handle those types of failures when they do occur. But that I think can be dealt with, because people do that with other systems all the time. But I think that most people’s problems with Kafka come when they want to scale up and scale down the cluster in response to load. The open source product doesn’t really give you much tooling when it comes to helping you manage that process. For example, in the open source product there is no automated tool to rebalance the data among the machines when you add or remove machines. Like, that’s kind of a table stakes feature in a lot of – if you’re thinking about a distributed relational database, that would seem kind of silly if you had to like run a scripts to move data between the nodes and the database. But that is true of open source Kafka. And there are now other tools that you can use alongside of it, that can take some of this work off of you, but they’re not always the easiest to use either. It’s not like a self-balancing, self-managing thing, like a lot of the distributed relational databases are. It’s something that takes a little bit more hands-on work.

And another thing that goes along with that is if you’re storing data for a long period of time, in the open source project they didn’t add a tiered storage feature until very recently, in the open source product. And the time that it takes just to copy the data around from machine to machine when you’re scaling up or scaling down the cluster can be hours or days, depending on how dense you’re running the machines.

Some of that is alleviated with the new tiered storage stuff, where the older data is moved to object storage, but that part doesn’t alleviate the inter-AZ networking costs. And there’s another post on our blog about tiered storage in Kafka if people are interested in learning more about that topic.

It is open source though, right?

Apache Kafka?

Yeah. The project is managed by the Apache Foundation, and has a variety of contributors across a ton of companies. And yeah, I would say it’s a fairly healthy example of an open source project in terms of like having a big community around it.

[00:17:47.25] There’s a margin of haters, let’s just say, towards Kafka. And it is open source, and I’m just curious… You may be in that bucket of margin of haters, because you’ve created WarpStream, right? So you’re kind of - not for, you’re kind of against, at least from an economic standpoint. And maybe a DX standpoint, and many other standpoints. The point I’m getting to is why not just improve Kafka?

So there are a lot of practical challenges with improving a large open source project, with a lot of users and a lot of dependent parties, I should say; not even necessarily just users, but stakeholders of all kinds. Making large, sweeping changes is essentially impossible. The amount of code churn required to take open source Kafka and get it to something resembling the architecture of WarpStream is just not going to – that’s not going to happen in any reasonable amount of time. That’s the first part. If you just wanted to abstractly, no financial interests involved, how would you do this - it would be very hard, practically.

The second reason is that WarpStream makes a pretty different set of trade-offs than the open source project does in terms of the environment that we expect users to run in. Now, I think trade-offs are correct for the world that exists today. But in the abstract, it is different than the open source project. So WarpStream stores data only in object storage; that’s step one. You need an environment that has object storage. And then step two is that we run a control plane for the cluster, which in the open source, the comparison would be kind of like somebody who’s running Zookeeper, or KRaft, which is their replacement for Zookeeper inside of the open source project… It’s kind of as if we’re running that for you remotely, and then you’re running the agents, as we call them, which is the replacement for the Kafka broker inside your cloud account. So there’s a very specific topology that we’re prescribing to our customers as well. That’s different. It probably wouldn’t fly in an open source environment, or at least it would make it even more challenging to run, potentially. I think those are probably the two biggest reasons of why we couldn’t just improve Kafka, is it just would be too hard practically to make improvements, and then also we’re making trade-offs around how we see the world existing today and how we think it’s going to continue to exist in the future, that a lot of the stakeholders to the open source project may not agree with our assessment there, basically.

Good answer. I was expecting a version of that. I was not suggesting that you should just not start WarpStream, and by all means just go contribute to Kafka and bail… But it’s always good to get that perspective, because Kafka has got history. It’s 13-ish years old. It was developed inside of LinkedIn for different purposes… That’s why I started off with the question which was their own infrastructure, because LinkedIn designed this for a different purpose than everybody else [unintelligible 00:20:42.11] uses it. It was not designed to be used in a cloud environment where there’s a lot of egress fees, and a lot of fees between moving data around… And so it was not really designed for its actual use case, or the usage space that it’s in. And LinkedIn did not charge its users those transaction fees. I assume potentially because - and I don’t know LinkedIn’s infrastructure history, but I assume because they had far more control over their cloud or their own environment to not have to deal with those costs than maybe everyone else who’s become a Kafka user has had to take on.

Yeah. The way that I like to explain that, the networking costs side, is that when you’re in a renting space in a colo, or you have your own data center, you’re implicitly paying for what is kind of a fixed capacity resource. It has a very high fixed capacity, but you are essentially paying for a resource that has a fixed capacity without doing a bunch of capital improvements to your data center. Whereas if you’re in the public cloud, you can show up and put a credit card down and start moving gigabytes a second across the network without asking anybody for permission, nothing.

[00:21:56.27] So you’re paying kind of a tax for that flexibility of being able to show up without asking anybody, all of a sudden start moving a ton of data; and especially in terms how spiky you can do it. You can write a hundred gigabytes a second for one minute, and then never pay Amazon any money again, and they have to do some capacity planning on their end, just like they do for every other service, and why they charge higher on-demand rates for EC2 instances than if you go and buy a random server off the internet and put it in your house. The cost looks very different.

Now, whether that cost is right, whether that reflects real economic realities, I don’t think anybody can say without being inside of Amazon, but I think there’s a pretty logical rationale for why it exists that way, because there are people that will consume bandwidth in a very different – you have to think about the worst case scenario users basically of your service, the people that you – you might even call it abusers of your service, in terms of your cost profile. So I think that’s why, as you’re saying, you’re correct that LinkedIn can just decide to use Kafka in a different way internally, to match their ability to provision infrastructure. And Amazon can’t really force you to do that in any way other than just charging you more money for it. So that’s why they do it.

So you and Richard - did you guys meet at Datadog? Is that where you guys connected, or was he at Datadog? Tell us a little bit of the history of you two.

Yeah, so Richard and I met a little over five years ago now at a conference. We met at Percona Live. I think it was 2019 in Austin. And he was working at Uber at the time. And yeah, so we did eventually both end up joining Datadog, but that was a little later.

Gotcha. And while you were there, you had put some sort of Datadog infrastructure on S3, or on object storage. Husky, I think… I’m going from memory now.

Yeah. So my co-founder, Richie and I, after he left Uber, we started working on a prototype of a system that was – the idea was basically Snowflake for observability data. That was like the elevator pitch. And we were going around, pitching that to investors at the time, and that’s how we got to know some of our investors in WarpStream today, as we met them back in those days.

And that eventually caught Datadog’s attention, and we ended up joining Datadog together to build that system, Husky, with – some of our current colleagues at WarpStream were also there at Datadog, building that system with us.

Basically, the idea there was to replace the legacy system inside of Datadog for a lot of the kind of – basically, anything that you can think of that’s not like pre-aggregated time-series metrics. The idea was to be – we think of it as like timestamp plus JSON. That was the data model, basically. And we wanted to move all that data to object storage for – there’s a ton of different reasons for it, similar to the reasons why WarpStream is useful… But yeah, over the three and a half years that my co-founder and I were there, we migrated all of the products that were using the legacy system over to Husky.

Yeah. I mean, that’s why I asked about it, because it seems like it’s a precursor to this very similar move with Kafka, right? Like, what if we took Kafka, ripped out the local storage aspect of it - sounds easy enough - and built something… I mean, by ripped out, conceptually ripping out, right? You didn’t fork Kafka and write this, right? You started over.

Yeah, we started from scratch, and writing it in Go.

[00:25:52.16] Right. So conceptually rip it out, but actually rewrite something that’s Kafka-compatible in terms of features and API, I assume, and all that kind of stuff. But no local storage; object storage. And your success with what happened to Datadog probably led the way for you to say “Well, if we did that, it would be a lot cheaper, basically.” And way easier to operate, because - hello, Amazon Web Services, right? It’s their problem now.

Yeah. There’s definitely a lot of high-level conceptual overlap. The systems are extremely different, because one looks more like an OLAP database, and the other is – I mean, Kafka is more like a log. So there’s some very high-level conceptual similarity. And I think the thing that we really got the most experience with there was learning about object storage. So that’s about where the similarities stop, is just like the deep experience of understanding how object storage works at scale in all of the major public clouds was a hugely valuable learning experience for us to know. When we left and we were doing the back of the envelope math on “Could we make this thing work?”, the experience with object storage that we learned there was pretty helpful.

Now, I think a lot of object storage – people talk a lot about object storage nowadays, so I think that’s not like an unknown thing to understand the characteristics of working with it nowadays… But I’d say in 2019 that was a fairly different story. I think the only people that would know a lot about building high-performance systems on top of object storage, they were probably all either inside the public cloud providers themselves, or they were working at Snowflake or a similar company. The knowledge was not super-well distributed at that time.

Most people, when they think of object storage, they think of something that’s super-slow. Like, they’re thinking about it in terms of like seconds of latency to do anything… And that you have to rework your – the numbers around it are very different than what people might think of off the top of their head, and that opens up a lot of design possibilities that you don’t think of immediately.

Break: [00:28:07.15]

What are some lesser known things about object storage that you know that we don’t know? Or maybe nobody knows besides you.

[laughs] Somebody’s gotta know…

Yeah, it’s not really one secret trick. I think –

Dang it!

…it’s just a conceptual framing, that you have to think of it as if you had access to a very large, oversubscribed array of spinning disks. If you think about it like that, then the conceptual framing of how you design a system around it will make a lot more sense. So there’s a couple different pieces of that… Really large, way bigger than your individual application. So you have the world’s biggest RAID zero of all the disks ever. Conceptually unlimited. So think about it that way. But also oversubscribed. So the latency characteristics of it are highly variable. One request might take 10 milliseconds and the other takes 50, and there’s no discernible reason to you why that is the case. It’s just that is how it works, so you have to design around that a little bit, in terms of retrying requests speculatively, and that type of thing. But if you have that framing of like it’s a very large, cheap storage, with variable latency characteristics, if you rework your application to think about how it would make it work on top of that, then you’ve got the right framing.

The reason why it’s so challenging for people today is that they think about – they spend all their time thinking about the fastest storage that’s available today. They spend a lot of time thinking about persistent memory, or NVMe SSDs, stuff like that. They think about that first when they’re designing their application, like “How do I get the lowest possible latency?” Making your application work on that first, and then trying to add object storage on top is a very popular thing that people try to do. They always call it tiered storage. Basically, every system that has that calls it tiered storage… And it’s very hard to match the characteristics of those two things together going top-down, whereas going bottom-up, the other direction, starting with object storage and then layering stuff on top… It seems like it should be the same, but it’s not. You don’t end up making the same design decisions along the way. And that has a big influence on the overall characteristics of the system.

I can explain specifically what that means for Kafka, in terms of tiered storage. So they were thinking about disks first, local NVMe SSDs. That’s usually what people are running it on these days in the cloud. The way that that influences the design is that – the way that they implement tiered storage is they just take those log files on disk, that have all the records in them, and they copy them over to object storage. That solves a cost problem. If you never want to read that data again, you’re good. Like, that’s cool. It’s much cheaper now. But when you want to come back and read it - let’s say that you wanted to read all of it, all of the data you’ve ever tiered off into storage… The way that that works in the open source project is that you’ll end up reading all of that data you’re going to have to pull back through one of the brokers. There’s no way for you to parallelize that processing, because they just view it as this bunch of log files that I’ve put into object storage.

And with Orbstream we’ve kind of decoupled the idea of the local storage being owned by one machine, to now there’s a metadata layer that says “These are all the files that exist”, and then we have all these stateless agent things that can actually pull the data out of object storage for you. You can scale up and down as quickly as you need to, to read all that data out of object storage. So if you wanted to pull it all out, you can scale up temporarily for the hour that you wanted to run some big batch job, and then scale back down at the end.

[00:35:55.27] With the open source tiered storage in Kafka, that’s a lot harder, because they started with the local disk part… Which makes sense, because that’s what existed before. It just means that adding stuff on afterwards, you’re usually – the tiered storage, lower layers of storage is like a secondary concern. It doesn’t get as much love and attention as the primary storage gets, and you end up with a very different system at the end.

For us laymen, can you describe how the brokers work, and contrast that again with these stateless agents? I understand that you can scale the agents horizontally, because they are stateless, versus a broker, which seems to have kind of a lock on some data… But what do Kafka brokers do exactly?

Yeah, so Kafka has – let’s start with topics. Topics are basically just a name for mapping consumers and producers together. They agree on the name of a topic for where they’re going to send the data to, and where they’re going to consume the data from… And within a topic, there are partitions, and a partition is basically just a shard to make that topic scalable. There are a lot of different ways to decide which shard you’re going to write the data to, but let’s just say for now you do it by hashing the key of the message, and then routing it to the shard based on the hash of that key. So if you have the record with the same key, you’ll wind up going to that same broker every time, that owns that partition.

So that’s how it works in the open source product. The brokers own some set of partitions from a leadership perspective, and then there’s also replicas of that, that are just copying the data… And it’s just other brokers that are the replicas for those partitions.

So the broker will write that data that it receives from a client, a producer client down to the local disk, and replicate it out to the followers, and then a consumer can come along, and read either from a replica or the leader the data that producer wrote. But they’re all coordinating on essentially - one of those brokers owns the partition specifically that I’m interested in and reading. So that’s how it works in the open source product. And in WarpStream we’ve decoupled the idea of ownership of a partition from the broker itself.

We have a metadata store that runs inside our control plane, that has a mapping of “Here are all the files in object storage, and within those files, the data for this partition for this offset is here.” It’s in some section of a file in object storage.

So any of our agents, which are like the stateless broker that speaks the Kafka protocol to your clients - any one of those agents can consult the metadata store and ask “I want to read this topic partition at offset X. Where do I have to go in object storage?”, and potentially multiple places object storage. “Where do I have to go in object storage to read that data?” But because the metadata store inside the control plane is handling the ordering aspect of it, essentially, you get the same guarantees as Kafka, in terms of “I have this message with this key, that’s routed to this topic partition, and I want them to stay in the same order, because I’m writing them in a specific order.” That ordering part is enforced by the metadata store inside the control plane, but the data plane part of actually moving all of those messages around is only inside the agents and object storage. So it lets you do that thing that I was saying before, where if you want to scale up and down, it’s very easy to do that, because you don’t have to rebalance those partitions, which take up space on the local disk, amongst the brokers in order to facilitate that.

So you’re reading metadata, versus reading the real data, basically, and that’s what makes it faster.

In terms of being faster, it’s faster at the fact that there is no rebalancing that happens, because the data is always just in object storage somewhere. You don’t have to do any rebalancing for it. That part of it is faster. There’s obviously a trade-off when you do this, in that the latency of writing to object storage is higher than writing to the local disk. So if you want your data to be durable, you have to wait for the data to be written to object storage first.

[00:40:17.21] So that’s the primary trade-off somebody that’s using WarpStream would be making, is that they’re comfortable with around 500 milliseconds at the P99 of latency to write data to the system, and then the end-to-end latency of a producer sends data and then it’s consumed by a consumer is somewhere between one to one and a half seconds, again, at the P99.

What percentage of the Kafka population does that cut out? Because it seems like many of them are highly real-time-oriented.

So it’s interesting that you use that word real-time, because we’ve talked to a ton of different Kafka users, and when you ask them “What is your end-to-end latency of your system today?”, a lot of them don’t know the answer. They think that they know the answer –

They say “Well, it’s real-time.”

Yeah. They’re either not measuring it, or they’re measuring it in a weird and incorrect way. There’s a lot of different ways that that can happen, but typically, the way that we’ve experienced it is that if you ask an executive at the company that uses Kafka heavily, ask them “Is your application latency-sensitive?”, they’ll say “Of course. We’re an extremely high-performance organization. We love high-performance systems. Obviously, the end-to-end latency couldn’t be anything more than 50 milliseconds. That would be crazy if it were anything more than that.” And then you make it a little bit further down the chain in the organization, you ask the application developer or the SRE who’s actually on call for the thing, or wrote the code, you ask them and they’re like “I don’t know… I hope that it’s fast, but I’m not really sure.” Or you ask them and you get an explicit answer that’s very different than the answer that the executive gave you. And realistically, there are a few applications that we come across, that do need that little latency… And the primary example of that – I mean, there’s a lot of this kind of application out there, in different domains, but the good example that demonstrates it is credit card fraud detection.

There are people out in the real world using credit cards, and you want to make a determination about whether a [unintelligible 00:42:28.16] is fraudulent at the point of time that they’re swiping the card. So that is like necessarily a real-time thing.

There’s a user who’s waiting out in the real world, and if Kafka is in the critical path, especially multiple hops through Kafka in the critical path, then a system that has higher latency, like WarpStream, would be harder to adopt. And there are other applications that meet this criteria. But basically, if the user is in the critical path of the request, then WarpStream is harder to adopt, like, in the abstract. Obviously, some specific applications might be okay with higher latency than others, but that’s the one that we see from time to time. When you strip all those out though, the things that you have left are the more analytical type applications. Like the example I was talking about before, moving application logs around.

Developers are pretty used to some delay between the log print statement running inside their application and being searchable inside wherever they’re consuming their logs from. So the additional one-second of latency there is typically a non-issue. And the reason why that’s useful for us as a company at WarpStream is that those workloads are typically really high-volume, and they cost the user a lot of money.

So our solution being more cost-effective really resonates with them, because – usually, there’s also a curve of the more data you’re generating, the less valuable that data is per byte, so there’s budget pressure to get the efficiency to process that data. You want to increase the efficiency of processing that data, and Kafka sticks out like a sore thumb in terms of that processing cost.

[00:44:19.22] So we can come in and say “Hey, because of the way the cloud providers don’t charge you for bandwidth between VMs and object storage, and we store all the data in object storage, that means you’re going to save this many hundreds of thousands of dollars a year on sending the dumb application logs that you’re generating into the eventual downstream storage”, that makes a lot of sense to them.

So while we understand that we can’t hit every possible application in the market with the shape that WarpStream is today, we’re pretty happy with the set of use cases and workloads that we can target, because there are just so many of them out there, and they happen to align with the budget-sensitive ones.

Those reads and writes - can you restate those? Did you say writes are at most in P99 500 milliseconds, and reads are one to two seconds in P99? Is that correct?

So the writes are around 500 milliseconds at the P99. That’s tunable. By default, we have the agent buffer the records that your clients are sending in memory for 250 milliseconds before writing them to object storage, so that you just write fewer files to object storage, which is the primary determinant of the cost of the object storage component of the system, if you’re not retaining the data for very long. But you can shrink that down all the way to 50 milliseconds, in which case the produced latency at that point would be probably ballpark 300 milliseconds at the P99.

I said end-to-end instead of read, because that’s typically what people talk about in Kafka terms, because they wanna know like, “A producer sends a message. How long does it take until a consumer can consume that message successfully?” So that’s what I mean by end-to-end. And that is one to one and a half seconds at the P99 for most of our users.

So latency aside, what are the other downsides of this approach?

So there really aren’t that many downsides other than the latency. The latency is what actually enables all of the benefits of WarpStream, basically. The object storage is what enables a lot of the benefits. We have a couple of interesting features that are based on the fact that all the data is in object storage. One of them we call agent groups. And agent groups let you take one logical cluster and split it up physically amongst a bunch of different domains. They could be like different VPCs within the same cloud account, it could be different cloud accounts… They could be different cloud accounts, or same cloud account but across regions, all by just sharing the IAM role for the object storage bucket between those different accounts.

The alternative to this with open source Kafka is like setting up something crazy, like VPC peering, which is extremely hard to do, and your security team will probably not be super-happy if you try to ask them to pair a bunch of VPCs together, because it introduces more security risks…

So we have customers in production using this feature today, where the example that we usually give is there’s a games company that splits their production games account, where all the game servers run, from the analytics account, where they do – basically, they run a bunch of flank jobs to process the data generated from the production account, and they run agents that just do produce, so just writes, they run that in the production account, and they run agents that just do fetch inside their analytics account. So they’ve kind of flexed the cluster across those two different environments, and all they had to do to set that up was share the IAM role on the object storage bucket, instead of peering the VPCs together.

[00:48:10.00] So the fact that everything is in object storage opens up a ton of new possibilities, actually. Basically, the only downside of WarpStream is the fact that the latency is higher. Now, obviously, we’re a new company; the product does not have the 13-year maturity of the open source Kafka project, but just to speak of the operational stuff and the cost stuff, WarpStream is a huge win on both of those.

Does it have any of the hosting flexibility? I suppose you’re putting everything in object storage anyways, so there’s probably people running their own object storage clusters… But that might be crazy. I don’t know.

Yeah, so there are a number of projects and products out there that you can buy to give you an object storage interface in essentially any environment. There is the open source project MinIO, and then basically every storage vendor on the market will sell you something with an S3-compatible interface if you’re running in a data center environment. And because we work with S3, GCS and Azure Blob Storage, we can essentially connect to anything. If you had an NFS server, we can even make it work on that, too. We don’t have anybody in production doing that, and I wouldn’t recommend it. I would recommend using the object storage interfaces. But we’re pretty flexible in terms of the deployment topology.

What about R2? Would you have even more savings, or would that not matter because nothing’s going outbound from the virtual network there?

So I think it would depend on where you’re running the compute. If you were storing the data in R2, but you were running compute in AWS, you would get charged a lot of internet transfer as a part of that. If you’re running your compute in one of the providers that has free peering with R2, then yeah, you would get a nice savings there, and you’d be able to move data reliably across let’s say multiple regions of whatever providers have peered for free with R2 using WarpStream.

I was thinking about getting started really, or just trying it out. I do like your curl demo, I did play with it. I had no idea what I was doing, but it was cool. The command is on your home page. It’s curl and a URL to an install script. I did not review that script prior to running it. I just trusted you.

You’re admitting that to everybody?

Well, you know, it was a VM on Proxmox, so I didn’t care, that I could just throw away. It wasn’t my own machine. I was safe.

That’s a good layer.

It did spin up, and then it gives you a URL you can go to to log in, and next thing you know you’re looking at a cluster. So I liked that aspect about it. Whose idea was it to come up with that demo? I mean it’s very hacker, it’s very developer. No pain whatsoever. If you’ve got a VM or you want to spin up a VM, or you have Proxmox, then you could do it safely, like I’ve done… Or you can spin up a droplet on DigitalOcean, or pick your own – if you’ve got a VPC, whatever. You could do it in a more safe manner and have some fun. What do you expect people to do with that? What are people saying about that, and whose idea was it to produce that demo? Because this is very hacker. I like it.

I think the demo was Richie’s idea. It basically just starts up a producer and a consumer, so that you can just see something happening in the console. It provides you a link. Like “If you would have run that locally on your laptop, we would have opened the link automatically in your browser for you.”

Mm-hm. It said it had a problem and I had to click it, so… Yeah.

Yeah so we even designed the little niceties like that. But the idea behind the demo is basically just to show people that it does something. Kafka is not an exciting technology to demo, so we’re kind of kind of limited there. It’s even more boring than doing a demo for a relational database, or something.

[00:52:10.05] But there is another mode that you can run that that’s called Playground. And Playground will let you start a cluster that doesn’t have like a fake producer and consumer running on it as a demo. It just starts a cluster for you temporarily, and makes an account that expires in 24 hours, and you can take that Playground link and you can start multiple nodes, say one on my laptop and one on yours, and point it at R2, and we can have a cluster that spans our two laptops together.

My co-founder and I did that before and posted a video of it on Twitter, or something like that. But because the data is all on object storage, and the compute part is stateless, it’s actually – it’s not that complicated to do. It’s basically the thing we were talking about a second ago with R2, just connecting two laptops instead of two different regions, or something like that.

So to get to the Playground version of it, is it like –playground? How do I get there?

Yeah so there’s three different commands primarily that people would run. There’s WarpStream demo, there’s WarpStream Playground, and then there’s WarpStream Agent. The agent is the one you would run for production, to start an agent. And the Playground one is how you start a Playground. I think the Playground even gives you – it spits out in the output the command that you would copy and send to somebody else to start it in another terminal. But it’s been a long time since I’ve since I’ve played with it, so I may be remembering it wrong.

The reason why people like the demo - or I should say the Playground - is that it makes it easy if you’re a developer to just start a cluster and use it for local development, instead of having to run… Like, if you use WarpStream in production, and you want to use the same thing in your development environment just to ensure consistency, you can use Playground mode to create a cluster. And yeah, it’ll just go away when you stop using it, and there’s no cost.

Yeah, I dig it. I kind of wish there was more documentation; if there is, then I would go find it, or maybe a video, or something like that… Because that’s kind of cool. I like this, demo because for those who just want to tinker, without having to spin it up in the EC2, or just - whatever; you know, go the extra mile. I love that you can just sort of do this on your own. But I had no idea the Playground version was there, or the agent version was there to go a little further. And there’s some room that you can make some content around that to give people more of a guidance, and you should do that.

Yeah, totally. The Playground and the demo people have found a lot of joy in, because they’re just cool. We also have a serverless version of the product, that basically just gives you a URL that you can connect to over the internet, for us you know to fulfill a similar purpose, basically, for people, if they want to try it out without actually doing anything locally on their machine. I think we give new accounts like 400 dollars of credit when they sign up, so you can you do a lot with that if you just want to play around without actually starting the infrastructure.

[00:55:24.22] Yeah. And I guess while I’m on your homepage, perusing just under this demo that is so cool, there is a mention of plug and play. Part of your angst, I suppose, to get to where you’re at was “Let’s rethink what this meant like in a modern time”, which is what you’ve done. But then also to be just swap-out. So one thing it says is there’s no need to rewrite your application to use a proprietary SDK. You just literally change a URL. Was that – how did you get there in terms of the… It’s fine to not want to contribute to Kafka and make your own way. And I’m totally cool with that. And WarpStream reinventing or rethinking this model. But how do you get to this point where you’re like “Let’s make this as frictionless as possible”? To focus on the DX of what it might actually be like to say, okay, well, if this is (like Jerod said earlier) that subset of folks that maybe they’re not doing credit card transactions and fraud detection where that needs to be literally real time, where the latency cannot be absorbed; in a scenario where it can be absorbed, and it’s a large population of Kafka users, to say “Listen, we’re here, and this is how easy it is to swap.” How did you get to that design, that idea?

We got there by just talking to people, basically. The number of developers out there who are using Kafka, it’s really high, and we talked to a lot of them. And when we asked them basically “What do you not like about Kafka?”, they would give us a bunch of different answers. But when we would ask them “If we could fix those problems for you, would you want to do that?” And it would involve, essentially, rewriting large parts of your application… That’s a non-starter for people.

And there are a bunch of other things out there in the world that integrate with Kafka, like Spark, and Flank, and there’s a bazillion open source tools out there that integrate with Kafka. We have no influence on any of those things either, really… So it was kind of a choice that was forced upon us. There’s really no way – Kafka has so much momentum behind it that it’s pretty much impossible to get broad adoption of something that would be a replacement for it without having the exact same wire protocol, so you can use the exact same clients and stuff like that.

It’s a lot of work to maintain that compatibility. Thankfully, a lot of that work is frontloaded. It’s just, you do it once, and Kafka is not a particularly fast-moving open source project, so they’re not changing the protocol every day. Backwards compatibility is very good with Kafka, so thankfully it was mostly a one-time cost… But it’s opened up a lot of opportunities, because we are compatible to even just doing basic stuff for the company, like being able to do co-marketing with other vendors of products that are compatible with Kafka. If we weren’t compatible with Kafka, we wouldn’t be able to do that. And a lot of the open source tools that we would want to integrate with, let’s say the open telemetry collector, or vector, these kind of observability agent tools - they all can write data to Kafka, and we inherit that benefit right out of the box. So it’s been super-important for us basically to have that compatibility.

And do you think that – I know you’re sort of youngish, but do you think that… I suppose, how are you winning? Are you winning the market? That’s what I’m trying to get to, is are you truly absorbing a lot of the Kafka user base? Is there a major demand for WarpStream? What’s the state of product-market fit, and are you winning?

Yeah, so we have a number of large use cases in production today. I can’t talk about very many of them, unfortunately, but there are WarpStream clusters out in the world, processing multiple gigabytes a second of traffic. And not just like one of them. There’s a decent number of them at this point. And where we’re having success in the market is basically the large open source users who are – they feel like the open source product is a bit too challenging for them to run, and there’s budget pressure all over the industry today, especially in the corners that we’re interested in, like in the observability and security and analytics side, there’s a lot of budget pressure. So we’re a pretty natural fit for those folks who are both tired of running the open source project, and they’re getting budget pressure to decrease their cost. We’re having a lot of success there.

[01:00:09.18] What about greenfield? Is there anybody that’s like “Okay, we need to adopt Kafka or something like it, but what is out there, before we go and write a line of code, or flesh out our infrastructure model, or make any plans?” What about those that are not migrating? What’s the path, I suppose? What’s the inbound of those folks, and what’s the path to the DX? Because one of the things you mentioned is that you solve a few problems. You solve cloud economics, you solve operational overhead, and one thing that you mentioned, at least the article that was from last year, was a major problem with Kafka, which was developer user experience. And that’s what I’m trying to get to there, is like, those who are coming on green, brand new, what does that user experience like, and what’s the path like for them?

Yeah, so I think that for greenfield projects there’s two different branches of those. There’s greenfield projects that are only greenfield in the sense that they’re trying to adopt Kafka for some goal. They’re not greenfield – like, the application didn’t exist before. There’s that aspect of it, where they’re just new users of Kafka. And then there are truly greenfield projects, where the project itself is new, and also the choice to choose Kafka is new. And usually, those projects don’t have a super-high volume of data. It’s the existing initiatives or applications within a company that process a lot of data, but are not using Kafka for cost reasons, where we are having more success.

There’s a product that I would love to talk about, that won’t quite be public by the time this episode is posted, but they’re in that first category, where it’s a large existing workload, but they were not using Kafka for a bunch of different reasons, cost being one of them. And they’re now a big WarpStream customer because they saw that there are benefits to using Kafka for their application, but they just couldn’t use the open-source project for cost reasons, and now essentially they can. There’s a lot of cool stuff that they can do now, that they couldn’t do before, that Kafka enabled them to do. And WarpStream is their Kafka-compatible product of choice for the those cost reasons, and they’re starting to get some benefits from it right now.

So I guess the obvious question to me at this point is Kafka is not dead. It is alive. It is open source. To my knowledge, I don’t think it is. WarpStream is not open source. Was there a conversation about licensing? Was there a conversation about being a commercial open source company? …just to follow in the footsteps of the predecessors that you at least from a conceptual standpoint copied and improved upon. You were led by here. You stood on the shoulder of giants. Where are you at with that? What have you thought about in terms of licensing and open source, and what’s y’all stance on open source as your core or not?

Yeah. So we had a lot of back and forth initially, when we were thinking about this specific issue. And the conclusion that we came to is that in order to be successful commercially, we cannot release our product as open source. And we did not want to pull the kind of bait and switch intellectual dishonesty move of the way a lot of commercial open source products have evolved in the last five years, in terms of either real licensing, or changing the focus of the project drastically to benefit the primary commercial backer.

[01:04:02.29] We just didn’t think that it was – we’re providing a lot of value by providing a solution that is dramatically lower cost, and also compatible with the existing ecosystem. And the way that that works in practice means that you can switch away from WarpStream, because you’re not locked into it from an application perspective or a protocol perspective. So we’re not locking you into something proprietary from an interface perspective, so it’s actually relatively easy to switch away from WarpStream, if you’ve decided to in the future, because you didn’t like something that we did. But we’re, hopeful that the fact that we provide something that’s dramatically lower cost and easier to use means that you won’t switch away, and you’ll continue to have the best of both worlds, so to speak, where there is an open source thing out there that will – obviously, it’s going to continue to exist, because it has a ton of users. But if you want to use our product to save money, and have something easier to use, you can as well. And we will be able to continue to invest in making that product better and better over time, because we are not stuck in these kind of middle of the road outcome issues that a lot of commercial open source companies have, where they’re forced a few years down the line to cash in all of their brand goodwill on a re-license in order to gain that commercial success that they wanted. We’re hoping to be able to – by sticking to this model, we’re hopeful that we’ll be able to be a good citizen of the Kafka ecosystem, in terms of making a product that’s not incompatible and proprietary and steering everybody away. We do put a lot of effort into testing clients. We find bugs in Kafka clients that are typically open source and make improvements there. But the core part of the product is not going to be open source.

What’s interesting about those real licenses is that they all were commercially successful companies, even at the time of the re-license. They had arrived. And at a certain size and scale, it seems that the growth curve has to continue to go vertical to satisfy investors, to satisfy public demand in the case of Hashi… I don’t actually know the state of Redis Labs or the commercial success or not of Redis, but many of them were large, successful commercial companies, bigger than most companies ever get before they actually went ahead and did that “not cool rug pull”. But I wonder if the pressure’s on them because it’s other people’s money; similar in your situation - you have VC behind you… And I’m just curious about that decision from your guy’s perspective. Because you’re a small team, probably well-funded in terms of you guys are highly successful software people, so you’re probably making good money…

[unintelligible 01:07:08.25]

Yeah. So why would not bootstrap? Why not bootstrap and then not have any of that VC pressure that you currently have?

That’s a really good question, and to take a step back from that question for a second, talking about the commercial open source stuff… This is obviously a little bit inside baseball, but as a part of going through that decision process, we talked to the founders of a lot of commercial open source companies, and we asked them “Let’s say you were starting our company today. What would you do?” And without hesitation, the answer we got was “I would not start it as a commercial open source company today.”

[01:07:53.19] And there are a lot of different reasons that they gave for that, and I can’t really give some of those reasons without potentially identifying who those people are, and I don’t want to do that… But the challenges of an commercial open source company today, with the – it’s not even just the hyperscaler cloud providers anymore taking your stuff and running it. That’s obviously a concern, but you can get around that with – like, the AGPL does a decent job of preventing some flavors of that.

The other issue is just like the competition within the category that they’re building their product in is extremely high, and having your source code out there in the wild, and letting everybody know your secrets, essentially, about how you made your product better - you lose a lot of the juice behind why you have these huge staffs of developers working on interesting things. It’s not to say you can’t protect that other ways either, with software patents and stuff like that, but people don’t – the appetite for software patents… It would do a lot of brand reputation, I think, if these commercial open source companies created a bunch of software patents and started enforcing them against each other, for example. It’s a very challenging situation today.

A lot of the companies that you might view as successful commercial open source projects, they might be successful in the iteration that they exist in today, or yesterday in the case of all these real licenses, where they have good adoption in the developer community, and they might have good success in the VC-funded startup segment of the world… But there is an inevitable push to go upmarket, and to go after larger and larger customers, because it’s effectively the only way to support growth. The growth of what you can achieve within the small – if your customers are all small startups, even medium-sized startups, and developers playing around in their personal capacity or stuff like that, the revenue opportunity is just really small, unfortunately, for a lot of these businesses. It’s much easier to sell a million dollar a year contract to an enterprise than it is to get a million dollars of revenue out of a bunch of small and medium-sized businesses.

So the temptation when the growth starts to slow down is “I need to go do that now.” Like, that’s the first thing your investors are going to tell you, is you need to go upmarket and get enterprise customers. If the product that you’re selling them is support or a couple of features on top of an open source project, your ability to exert pricing pressure on that enterprise buyer, to get them to pay a higher price, or to get them to pay at all in the case of a lot of these open source projects, where they spent so much time making it good that the enterprise can just hire one person to maintain it internally, and just move on with their life and run the open source forever, and maybe pay you a peanuts support contract, essentially; not actually enough to support the business. It’s just really hard.

I completely understand where you’re coming from, and it might’ve felt as if these companies were successful from the outside, and some of them definitely were… But just, there is that inevitable pressure to keep the growth rate up, and the only way to do that is to go upmarket. And when you’re going upmarket, you need to provide something that looks valuable. And if your project is open source and the alternative is hiring a developer to maintain it internally, you kind of have a cap on how much you can charge.

It’s the same thing if you’re offering a cloud version of an open source project, for example. The premium someone will pay for your cloud version - it may be lower than you expect if they can self-host, because they’re always looking at that… They’re looking at both sides of the coin. “How much will it cost me to self-host this, versus how much does it cost to use your cloud-hosted version?” And that calculus does not always come out in your favor as a vendor. You may have to charge significantly more to make the numbers work on your side than what they think they can run it for internally. It’s really challenging stuff.

[01:12:18.08] We wanted to provide the best product possible, with the best product experience possible, and we didn’t feel like the shape of an open source commercial company was the right way to do it without having a lot of these distractions about the things that I’m talking about right now come up along the way. And we didn’t feel like it would be right to do that, the bait and switch thing that people are doing these days. We wanted to be honest, basically, from day one.

That makes sense, to some degree. I don’t fully agree with all of your sentiment, although that’s a very deep and lengthy conversation, teetering on just not fitting this conversation necessarily… But what I can appreciate, given that I don’t fully agree with all of your reasons, the one reason I think that you’ve done well, or I suppose the most positive thing is you’ve made it easy to get in and get out. So if for some reason WarpStream is of great benefit, and let’s just say a year down the road somebody does WarpNotStream, and it’s commercially open source, and they eat your lunch, because they decide to be open source first, and they can get into that just as easily as they can get out of you, then that’s a whole different story. I’m not suggesting that’s going to happen, but it’s possible.

It’s totally possible. Yeah. And you’re exactly right. If one of our competitors came up with a better implementation tomorrow, and it was –

The exact implementation. They can literally copy everything you do, and the world would be okay with that because they made it open source. That’s a version, or at least a subset of a conversation we had at length on this podcast a few weeks back with JJ, Joseph Jacks. He was like “Yeah, I’m totally cool with somebody, a founder going out there and literally copying X, and saying this is now X as open source.” He was totally cool with that. I’m not saying that makes sense completely to me too, but the world now believes that’s an okay thing. And it’s an okay thing because at the core it is meant to be an open source commons good.

Yeah. I would not have really ill will towards someone who decided to do that.

I would… I would be like “Come on, man… Don’t do that.”

Well, someone’s going to do it. I mean, as you guys have success – now, whether or not they can actually pull it off is the question, right? But there will be, at some point, as WarpStream continues to grow, a Hacker News number one story, “X is like WarpStream, only it’s open source and self-hosted.” And it’ll get 500 to 1,000 – and maybe it gets adoption, maybe it doesn’t. Maybe by then you guys are so far ahead it doesn’t matter… There’s tons of what ifs. But it will happen, from somewhere in the world, if you’re successful.

And the reason why it doesn’t bother me so much, basically, is the portion of the Kafka market - because we have commercial competitors, obviously; the portion of the Kafka market that has been commercialized – let’s say somebody is paying a licensing fee or some of their fee to use the product, not just hiring somebody to run out for them… The portion of that market that’s been commercialized is very small. So there’s so much green field market out there for us to commercialize, along with this constant, ever-increasing trend of things becoming more real time. And these other tailwinds of more observability and security data being generated in the world… This market is just going to be so big in the future that I think it’s unlikely to have a winner-takes-all dynamic, similar to the way that there are multiple large public cloud hyperscalers that exist, and are very profitable… There’s just so much of this market out there that we’re not super-concerned about any particular competitor. Even if one were open source, there’s a lot of other dimensions that we would hopefully be better at competing on, that you don’t get out of just the fact that the product is open source, combined with the fact that the market is so huge that we’re pretty happy with our position as it is today.

Break: [01:16:49.00]

So let’s go back to bootstrapping then. It seems like the kind of thing you could bootstrap… I mean, it’s just you and Richie, coding it up on nights and weekends, you know; get it rocking and rolling. Keep all that equity. No one to answer to… You’re going to get customers pretty quick, then you can start hiring based off of your customers… Why that decision to raise?

So the reason why people raise money is – let me put it a different way. The right reason to raise money is that you want to go faster. That’s basically why someone should raise venture capital; it’s they have something that’s working, and they want it to go faster.

My co-founder and I had so much conviction in what we were doing in terms of it being commercially successful that we knew on day one we would be able to go much faster if we raise money. So that’s why we did it. There was never a period of time where we were guessing, like “Oh, do people need this?” It was very obvious to us from day one that we wanted to go as fast as possible. And raising money is the way to do that, because we were able to hire a lot – you know, relative to the two of us, many more people, and pay them very well, and make them happy, and… You know, hiring people that are good at distributed systems stuff is very expensive. And those type of people also really appreciate job security. So being able to have a bunch of cash in the bank, even if we’re not spending it, is very important to those folks. So our internal stakeholders, as employees and founders and stuff, it makes it very comfortable to have that cushion, and it allows us to hire people that will make things go faster.

And then on the complete other side of the coin, if you want to sell products to enterprise buyers, as two people, without having raised any money, it’s going to raise a lot of eyebrows if they want to put that in production as the backbone of their multi-billion dollar business.

That makes a lot of sense.

It’s really hard. Whereas if we can walk into a meeting and say “Hey, we’ve raised roughly 20 million dollars from Greylock and Amplify partners”, who are our Series A and seed investors respectively, that sidesteps a lot of really awkward conversations about “What’s going to happen to you founders if you get hit by a bus tomorrow, or something?” Obviously, that’ll be very bad for the company, but there is at least somebody else who cares, and would like to continue to hopefully see their investment succeed.

So the dilution stuff is really – obviously, it’s a good point… But you just have to think, are the odds of success higher, and will the eventual outcome be bigger if I raise VC? And if that is true, then I think it’s worth doing. But if you’re in a position where you don’t know if your product is going to be commercially successful, it closes a lot of doors to raise VC. Like, every further round that you raise, it makes it harder and harder to explore different kinds of exit opportunities that you might personally view as a success, but your venture investors may not be a success.

So it’s definitely a balancing act, but you just have to go into it with your eyes open, and understand what you’re – you have to understand the game you’re playing, basically, and walk into it with your eyes open.

Had you played this game before?

[01:25:43.04] Yes. Very briefly, a long time ago, unsuccessfully, I did, yeah. And in between that and starting WarpStream, my co-founder and I were considering raising money for the thing that we were doing before we joined DataDog, and that’s how we got to know our seed investors at Amplify Partners. And we didn’t have that conviction at the time, to say “Let’s go raise money. This is going to be huge.” In hindsight, we probably would have done very well with that, had we chose to raise VC and remain as an independent thing and all that, instead of joining DataDog. But because we didn’t have that conviction, we took the “exit opportunities” that were available to us at that moment because we hadn’t yet raised money; we were very flexible, so we were able to join DataDog, and it worked out super-well, and we got to meet a bunch of interesting people, and the project we worked on was successful and super-fun, and all that stuff. But because we did have that conviction this time around, and we wanted to go as fast as possible, that’s why we chose to raise money this time around.

I think your reasons are sound. I don’t disagree, and I will not argue.

Good answer, I’ll give it to you.

I will not argue. You know, we check wisdom. While we love open source, I don’t think that you would have had –

I can see how going the route of venture capital, and not going, as you had said, some of the burden of open source in terms of distraction, was your actual word… I can understand that. And that’s your prerogative, right? Bobby Brown is dated in terms of an artist, but –

Nobody knows about [unintelligible 01:27:24.19] anymore.

But it’s my prerogative, it’s still a true phrase. I’m sure that –

Ryan, do you know Bobby Brown?

It’s been a long time since I’ve heard heard any Bobby Brown, but I do indeed, a little bit.

I grew up on Bobby Brown, so I can’t [unintelligible 01:27:40.12]

“It’s my prerogative”, right?

Yeah, it’s my prerogative.

Yeah, great song.

Sample: [01:27:46.15]

You know, so it’s your prerogative, and it’s Richie/Richard’s – great name, by the way, Silicon Valley. I mean, I had to bring it. He was called Richie, and his name was Richard Hendricks. But he was called Richie by his attorney. I don’t disagree with the reasoning for your your direction. I hope it works out for you. I think it seems like it’s going to, but I do agree with what Jerod said, which was there is probably going to be, if you hit critical mass and enough scale, somebody who copies what you’ve done, and simply just says “Okay, literally copy, and now it’s open source”, and they’ll be okay with that. I don’t think that you should operate in a state of fear of that, and make choices because of it, because that’s free market, man; that’s going to happen, you know? But good on you for being able to answer these hard questions. I think you did well on that front. I don’t have any argument, really, that’s all I’ll say.

And that’s only because we spent a lot of time thinking about it, and a lot of time talking to folks who are like day to day building commercial open source businesses, that really brought our perspective to where it is today. And it’s not to say that there are no possible opportunities to start a commercial open source company that would be successful today. There obviously are. It’s just that for our particular market and the strategy that we were pursuing, it just wasn’t going to be – I think I can put it a little bit more crisply. The segment of the market that we’re going after is already price and cost-sensitive. If we offered them the opportunity to run our products for free, the odds that we will be able to charge them almost any money would be pretty low. They’re complete – like, there are other markets out there that have completely different dynamics in this, especially if you’re not trying to provide the low-cost solution.

[01:29:50.20] So I didn’t I didn’t mean to denigrate commercial open source companies, I was just saying that when we explained our strategy, basically, to these other commercial open source founders, they said “That’s going to be hard. It’s going to be very hard for you. So you should think about it before you choose to go down that path.” And we chose this path because we think it’s most likely to be successful for us, while also - I would be personally very upset if I had to do one of those license change rug pulls. It would make me very sad, because I know it causes a lot of consternation and heartburn for people when those things happen. So we just wanted to be straight up with people from day one.

I also think that you are a particularly easy target for the hyperscalers to re-clone, and host, and offer, because of the nature of what you’re doing.

Yeah. I mean, it’s a general purpose infrastructure building block, and Amazon has [unintelligible 01:30:50.22]

Right. That runs on AWS, and –

Yeah. Amazon has MSK, it has a competing product with WarpStream. So they very directly could just offer a new skew of MSK, that is the WarpStream one, if it were open source… And that would be very challenging for us.

Ride your coattails. Are there other competitors out there? Are there other people that are putting Kafka on object storage?

Yeah, I mean, there are a number of companies out there that have talked about how they’re doing this. I think the most notable of them out there would probably be Confluence announcement of their Freight product. That’s probably the splashiest announcement of any of them, where they’re taking a similar direct to S3 approach as WarpStream does.

And the product isn’t available today for anybody to just go sign up for and do a comparison, but they’ve made an announcement, and I’m sure that’s going to progress more in the future.

I’m sure essentially every one of our competitors, if they haven’t started working on it already, a similar storage engine, they will. So I have no doubts that the cat is out of the bag, so to speak, on the idea.

Yeah, a better way. Well, that does make sense then why you went venture capital, so that you can go fast. And I think that from a visual standpoint you’ve done well; from a brand standpoint, I think your marketing side is pretty, pretty awesome. I mean, there’s obviously always room for improvement, but it’s pretty solid.

I do want to bring up the idea of pricing, because I don’t disagree there either, that there’s large corporations, enterprises, so to speak, Fortune 500s that if you’re not charging them $10,000, $20,000 a year, they’re like “What’s wrong with you? We can’t use you. We literally need to give you a lot of money to trust you.” And that’s just the nature of the beast there. But when you land on your page for pricing out the gate, the TCO, total cost of ownership is – at least the default numbers that are put there, is $2,295 per month.

So you’re not even scaring people away. I mean, you’re literally putting your fist in their face and saying like “It costs a lot, y’all.”

Yeah, but that’s the cheap version. These people are probably used to paying more than that, right Ryan?

Yeah. There’s a little slider that lets you turn on the breakdown mode of the comparison to open source Kafka running in 3azs, or 1az, or comparing to AWS MSK… And we put – we didn’t even put a particularly big workload as the default on the pricing calculator. I think it’s a pretty, pretty standard workload. And people are used to looking at big numbers when it comes to running –

100 [unintelligible 01:33:53.01]

[01:33:56.21] Yeah. When they’re used to running Kafka for these kinds of observability and telemetry workloads. They just cost a lot. If you look a little bit further down the pipeline there, if they’re sending the data to Elasticsearch, or Snowflake, or Clickhouse, they’re probably paying significantly more for those things. So Kafka looks cheap in comparison, and then WarpStream looks cheap compared to Kafka. So we’re very open about the fact that our product is designed to be more cost-effective, but we do offer additional - we call them account tiers, basically, where the things that enterprises want from you, the reason why they want to pay you $10,000 a month is they want to be able to file a support ticket and have somebody reply to their support ticket extremely quickly. That’s the thing that they’re basically paying you for. That’s the stuff that doesn’t scale, basically, as you get bigger, or your product gets better. Obviously, you might have fewer support tickets, but you still need humans to be able to respond quickly when somebody does file those support tickets.

So our account tiers for pro and enterprise give customers a support response time SLA that they can count on, that today is backed by the engineering team. If you’re an enterprise customer and you file a priority zero support ticket, which is just like “My production cluster is down. I need help right away.” That pages the engineering on call rotation, and gets you help as quickly as somebody can respond to pager duty. So that’s the type of stuff that people would be paying for basically on top, and that’s how we make enterprises trust us.

Another reason to raise venture capital - you can hire people, so you can have a 24/7 follow the sun on call rotation in order to back those support response time SLAs.

So if you needed five-gigabit write throughput - which I imagine is quite high, but let’s say that you do - 14-day retention, so that’s two weeks retention… Not that much. We’re talking 97 grand per month going to WarpStream, and $1.76 million a month using Kafka? These are numbers that blow my little mind.

Sorry. I didn’t hear the first – your throughput number that you [unintelligible 01:36:22.18]

It was the highest. It was five gigabits.

Five gigabits? Yeah. I mean, obviously, as you get up into these larger and larger – well, first of all, 14 days, pretty long retention for most people for Kafka. Usually because it’s a transitory –

Okay. What’s typical?

I’d say three to seven days. That’s a pretty typical one. And if you’re at these kinds of scales, you’re probably not paying your cloud provider retail price for cross-AZ networking anymore. If Kafka was a big part of your bill, that would be probably one of the items that you would want to negotiate with your cloud provider. So the comparison doesn’t get nearly as rosy if you’ve negotiated some discounts… But the way that you can kind of estimate what those would be is if you switch it from Kafka 3az to Kafka 1az. That will reduce the inter-zone networking dramatically, and turn on the single-zone consumer’s flag. So the comparison doesn’t look quite as good anymore.

Still 10X.

Still looks pretty good. [laughs]

Yeah. One-day retention. Turn it to one-day retention, and then it goes to 86% savings, versus 60% savings. So it’s still big, but we understand that there are a lot of big Kafka workloads out there, and we’re confident that if we can deliver 75%, 80% savings, they don’t always come out at 90%, like that example does. But if we can deliver 75% to 80% savings, it’s a compelling enough reason for someone to – there’s a little bit of activation energy it takes to get people to do anything. And we’re confident that that 75% to 80% cheaper thing is enough of that activation energy to get people to at least give us a shot.

[01:38:14.00] I want to point out that these are just dollars, too. This is not developer friction, or operational burden, or enhanced developer experience, which are the hallmark of any conversation today with DevTools, right? Like, you could be a 13-year-old tool like Kafka, and get away with – and I have no idea. So no skin in the game. I’ve never used Kafka personally. So if there’s some haters out there, [unintelligible 01:38:36.11] haters I mentioned earlier, don’t hate on me. But there may be some warts and blemishes and burdens within the Kafka ecosystem that just makes it just challenging to operate, to stand up… Obviously there’s costs… We’ve already talked about that, literally at length. But I think there’s something to be said about a modern take, given today’s cloud infrastructure, with some of the dev user experience attributes I’ve seen you already put in place.

So cost is one thing, but then happy developers is retained developers, morale boosts, maybe freedom on weekends, less pager duty, less whatever from anybody who might be competing with pager duty… That’s a good thing.

Yeah, at WarpStream we know that that’s like us. A very important part of what we do. But it’s always easier to walk into a sales conversation with the hard facts numbers, and not the – a lot of vendors use those exact attributes to describe, to attribute a lot of savings to their product… Which is probably true. But they feel a little bit more wishy-washy compared to the hard facts numbers. So that’s why we lead with those in our pricing calculator.

And obviously, those are still things that we highlight when we’re talking to potential customers, to help them understand the value of the product. But we like to think of that as more like the icing on the cake stuff, and the cost savings is what we’re promising them, basically. Everything else is just icing on the cake.

Icing on the cake. What’s a good next step? I mean, I feel like we’ve really just gone through all of it, Jerod. Do you have anything else?

I think we have. We’ve covered it all, man.

I think we’ve covered every ounce of WarpStream. Ryan, thank you for being patient with our questions, and going through everything, and filling in all the blanks, too. I think you did a great job with this conversation. I’m happy,

I’m impressed… I think there’s a lot of things I can see as quality in you as a person, and then also the thing that you’re trying to do. I think you guys have led with some wisdom.

I like a lot that you went out and talked to folks, rather than just shooting from the hip, so to speak, with your choices, and letting it be opinion-based. You seem to have leaned into the wisdom of those who have come before you with your particular target market, which I think is key to your choices. And so I’m stoked that you were able to answer the questions we asked. So thank you.

Yeah, this has been very fun. I was not expecting to talk about raising money at all during this conversation, but that was something that we spent a lot of time – when you’re building a company, you have to spend a lot of time thinking about strategic stuff that’s not just writing code, and that one was a lot of back and forth with my co-founder and I about how we’re going to do things… And we’re very happy with our direction now, but yeah, it took the input of a lot of people to arrive at this conclusion. And yeah, we’re very thankful for those people that made themselves available for learning more about commercial open source stuff, because we had never really even considered it before… And super-important to learn along the way.

Very cool. Well, WarpStream.com is where you can go. We’ll obviously put links in the show notes… Ryan, thank you. It’s been awesome.

Yeah. Thanks, man.

Thanks.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00