Founders Talk – Episode #75

The journey to massive scale and ultra-resilience

featuring Spencer Kimball, CEO & Co-founder of Cockroach Labs

All Episodes

This week Adam talks with Spencer Kimball, CEO and Co-founder of Cockroach Labs — makers of CockroachDB an open source cloud-native distributed SQL database. Cockroach Labs recently raised $160 million dollars on a $2 billion dollar valuation. In this episode, Spencer shares his journey in open source, startups and entrepreneurship, and what they’re doing to build CockroachCloud to meet the needs of applications that require massive scale and ultra-resilience.

Featuring

Sponsors

LinodeGet $100 in free credit to get started on Linode – Linode is our cloud of choice and the home of Changelog.com. Head to linode.com/changelog OR text CHANGELOG to 474747 to get instant access to that $100 in free credit.

Grafana Cloud – Grafana Cloud is our dashboard of choice – Grafana is the open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.

FastlyOur bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.

LaunchDarklyShip fast. Rest easy. Deploy code at any time, even if a feature isn’t ready to be released to your users. Wrap code in feature flags to get the safety to test new features and infrastructure in prod without impacting the wrong end users.

Notes & Links

📝 Edit Notes

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Spencer, let’s begin with building a company, I suppose, out of open source. What has been your experience with that? Obviously, you’ve gotten to a series D, so it’s been pretty much successful… But what’s the challenges? What’s the ups and downs of that kind of road?

Yeah, still a long ways to go. I think for us building a database and trying to turn that into a company, an open source database, there wasn’t really any other option. There’s been some other examples of closed source databases built in the last ten years, and it’s a pretty difficult uphill slog. There’s some really good open source databases that have existed since the early aughts; MySQL, Postgres are some good examples… And if you are going to play with those as alternatives and you’re not open source, it’s very difficult to get developers’ attention. I will say that the landscape for building an open source company has changed dramatically since we started Cockroach in 2015 and in 2020. I mean, five years doesn’t seem like it should be a lot…

It’s a lot of years. It’s a lot.

These days five years is more like 20 back when I started my career. Things are moving a lot faster. And the shift to the cloud, while it creates huge opportunities, is also changing what open source means to open source users, open source developers, and in particular the companies that might pay for an open source project as sold and supported by a commercial entity like Cockroach Labs. I think if we wanted to get into that, it’s really about consumption model. I’m happy to talk more about that if it’s interesting.

It totally is. What do you mean by consumption model?

Yeah, just think of this sort of generationally… I’m sure this has been true at least partially, for most of the listeners. The older listeners will have a more visceral reaction to the way things were, let’s say pre or late ’90s. If you wanted to use software back in the ’80s, the ’90s, and also the aughts and even today - if it were closed source, it was a pretty difficult procurement road. You had to identify the piece of software that you were interested in, and then contact sales of whatever vendor was selling it; that would go through your procurement department, you had to get all kinds of different [unintelligible 00:03:53.27] just to use this software.

[03:57] Then you got printed manuals shipped to you… There wasn’t much a community to ask questions of. You could contact support and so forth, but all of these things were just much slower, more tedious; considerably slower - let’s call it an order of magnitude, potentially more - in order to actually use software, put it into production, kick the tires, whatever it is that you wanted to get done as a developer. Open source just dramatically improved that. And I’d say that more so, for example, than having the ideas free, or even the price tag being free. Those are two aspects of free that people talk about with open source. But it’s the speed with which open source technologies could be downloaded, compiled locally, and then run and explored and even put into production; that was such an improvement, and it ultimately led to open source eating the world, as has been said by Andreessen Horowitz. That is the paradigm that existed, I would say, in 2015, when Cockroach was really conceived of as a product and then as a company.

What’s really interesting is that model is itself rapidly being overtaken by a new consumption model, that’s actually even easier than open source was compared to closed source… And that’s to use software as a service. And I did mention that in my description of open source - you could download the source, and compile it, and so forth. Obviously, there were different evolutions within even that model, where you would start to download binaries that were precompiled for a particular system, and then even packages and things that sort of bundled things together. I think that’s the more common thing, let’s say for a Docker container. All of those are innovations, but as a service, this is kind of a next generation, fundamentally, where you don’t have any operations; you don’t have to learn how to become a system administrator, or whatever DevOps requirements are necessary in order to understand and then run a system day one plus. How to monitor, how to understand and how to debug it. Those things are still required, to some degree, but you can obviate a lot of that labor; and especially when you’re a larger entity trying to use software, this means that the investment necessary to use software has decreased accordingly.

Ultimately, it feels like the writing is on the wall for software to be consumed as a service increasingly. The question is how does open source fit into that? Now, I believe in open source. I’ve been doing it now my entire career, since I went to UC Berkeley, and Peter, my co-founders who’s our CTO - him and I started the GIMP back in ’93. That world was magical to me when I first entered it… And I care deeply about open source, especially from the perspective of the free exchange of ideas… But you can sort of squint right now and look at open source in the aughts and the tens… What do people call that decade?

The tens, probably…

Yeah, the tens… You can start to see it - not vanishing, but changing almost unrecognizably. If everything’s consumed as a service, the interest in open source will necessarily wane. I don’t think open source, just because it was a free exchange of ideas, would have succeeded like it had if that’s all it was.

So when the consumption model of open source loses traction in favor of something that’s even better from an average user’s perspective, what will the future hold for open source? That’s an interesting question. I would like to see it preserved, so one of my big interests with Cockroach as we build Cockroach Cloud, which is CockroachDB as a service, is how to preserve the best aspects of open source.

[07:45] There’s a saying - ideas are crap, execution is everything… And I suppose this consumption model as it relates to open source seems like that, like open source is the idea; the freely-exchangeable idea, forkable etc. And the execution might be the service. Because that’s what you say - execution is everything. So if Cockroach Labs didn’t create Cockroach Cloud, then you’ve got Amazon or XYZ cloud provider essentially using your open source, providing it as a service, and potentially getting paid and you not. That’s the troubling model, I suppose, of the business you’re in, and that’s a part of the complexity you’ve been navigating these last - at least several years. Maybe not so much in 2015… You may have been starting to begin then, but it’s really been prevalent since, say, 2017 to now, where that consumption model has drastically changed to where you create the open source, and the community obviously, and then the cloud provider doing the execution, the service, and potentially getting paid for it while you don’t.

Yeah, that’s a really interesting way to look at it… And I think there’s quite a bit of truth to it. I would extend the definition of execution all the way back into the continued investment in building the actual project…

If that doesn’t progress, then the service will start to look a little long in the tooth after a couple of years, and eventually not really be viable. So the execution, unless you want something to sort of die on the vine, has to extend to the investment in the open source project, and that’s really what’s wrong with Amazon’s – people call it strip-mining, but their exploitation of open core companies that are doing all the investment back into the core, the open source core of the project, and then Amazon swoops in and is able to use their platform advantage to really exploit a lot of that value that they’re not reinvesting in…

So I do think that Amazon’s exploitive and predatory tactics around open core companies is just short-term profit for Amazon, and ultimately Amazon’s customers. I don’t really wanna make a big value judgment about what Amazon’s doing. Yeah, it’s true, if I use the word “predatory”, there’s an implied value judgment, but I don’t fault Amazon for doing what they’re doing; I think it makes perfect business sense, and it’s in line with their mission and their value, which is to obsess about their customers.

But nevertheless, it doesn’t leave a lot of space for a company like Cockroach Labs if they were to use Cockroach database and simply repackage it and win the market because so many people use AWS. That’s ultimately gonna cause CockroachDB to cease being improved, because if Cockroach Labs went out of business or we had much less capital to work with because of Amazon reselling our product successfully and sort of forcing us out of the market, obviously the improvements to CockroachDB would slow down to a trickle and maybe stop… And then what happens? I don’t think anyone really benefits from that scenario.

Yeah. You’d mentioned a career in open source… Take us back a little bit. If you wanna start at 2015, that seems pretty shallow, but at least that’s the beginning of Cockroach Labs and what you’re doing with CockroachDB. Maybe take us back to, I suppose, your experience level with open source. You’d mentioned the GIMP. Did you mean the GIMP in terms of the editor the GIMP when you said that?

Yeah, that’s right.

Is that right? So you’re one of the co-founders of that, or one of the co-creators of that?

Co-creators, that’s probably the right term.

Yeah. Peter Mattis and I, I think in 1993… We had really become converts to UNIX and free software and open source, and I had actually bought a used Sun Microsystems – I can’t remember what the name of the actual model was, but it cost me a couple grand… It probably wasn’t as good as even the high-end PCs that were on the market at that point in time; it was like ’92-’93… And yet, it seemed revolutionary to me, coming from a Windows operating system.

It was counter-culture. That era even, that timeframe even… That’s when Bill Gates was still CEO.

That’s right.

He’s a different guy now, in many ways, in terms of publicly, because he seems likable, and soft, whereas then he was ruthless. Everything was – it was a different world.

[12:15] Yeah, I think he had an evolved outlook; or he has an evolved outlook. I’m sure it continues. It’s quite impressive to see that change. Yeah, back then we were very impressed; so many aspects of the free software, open source world, and Unix in particular… And yet the desktop application seemed to be a decade – I mean, they just were not on the same playing field. That’s what you could get in Mac and Windows at the time. And Photoshop was a really good example of that. Both Peter and I were really kind of graphics aficionados back then; maybe we still are, to a certain extent… But we felt like – okay, we love so much about this new operating environment, but we can’t get simple photo manipulation tasks done.

I remember one day – we were using XPaint and XView. Those were the two options really that were available to us. We sat down at one point and just kind of wrote a manifesto… “Hey, if we wrote something that could replace some of the things you use XView for and some of the things you use XPaint for, and make it look something like Photoshop in terms of its capabilities, that would really be the start of something.” I wish we still had that manifesto, because it was pretty peculiar in my recollection of it. I don’t think the GIMP turned out anything like that manifesto… And we weren’t really thinking that this would be a GNU project or anything like that when we started it, but I guess we ended up working on it for most of our undergraduate careers (for four years), and sometimes to the exclusion of our class work, and so forth… But what a learning experience, to really dive head-first into something that became so ambitious.

Yeah. And successful, too. I know many people still today even that use it. Are you involved in the project at all anymore?

No. That’s, again, part of the magic of open source, and part of what makes me so proud of it. In ‘97 Peter and I both stopped working on it. We sort of pushed it out of the nest, and it was either gonna learn how to fly on its own, or it was gonna crash and burn and not have a future… Ultimately, the open source community adopted it. There were a bunch of authors that had already been contributing to the GIMP; many of them continued, even after Peter and I left Berkeley and started in industry… But the GIMP continues strong to this day. I download it every time I get a new computer, and I’m extremely grateful it still exists… Because I don’t really do enough photo manipulation work, but I wanna download or actually pay for Photoshop, so I’m really excited to use GIMP every time and see how it’s improved.

I might be going a little layer deeper, but you mentioned that you weren’t planning for it to be a GNU thing originally… Is that right? Did I hear you correctly?

That’s right.

But yet its name is based upon GNU… So did the name come first, or the software? Where did it get the name?

It’s a good question. The name came right around when Peter and I saw Pulp Fiction. So you can guess the character it’s named after… I think my sense of humor is honestly pretty childish still, which is part of why Cockroach is called Cockroach… But we were thinking of names for it at that point in time. We’d already made some good progress with it; we still hadn’t named it, but we thought “Okay, we could call this an IMP, like xIMP”, we were thinking. An IMP would be a little familiar, or something like that. And that stood for Image Manipulation Program.

And then because we’d just seen Pulp Fiction, Peter suggested “Oh wow, this is awesome. We’ll just call it the GIMP, and we’ll make it a GNU project.” So that’s really what sealed the fate that it would become part of the free software movement. And we were thinking “Okay, it could be called General”, but then we ultimately called it GNU.

[16:07] That’s a good movie. Gosh, such a good movie.

Yeah, it really was.

Okay, what’s next then? In terms of open source, what was your pathway from there? So UC Berkeley, you spent four years roughly this, based upon what you had just said there… Eventually you left the project, because - hey, that’s how open source works, and you moved on with your career… What was next for you?

Well, interestingly, I wasn’t super-interested in just being a software developer when I left Berkeley. I really wanted to potentially work on Wall Street, or be a consultant and travel and see all kinds of different businesses in situ. I ended up taking a job at Accenture, which was called Andersen Consulting back in 1997… But I stayed there only four months, because it wore pretty thin pretty quickly. It wasn’t the glamorous lifestyle I had imagine it would be. There was a lot of sitting around and working on silly projects that weren’t challenging in the way that, for example, writing the GIMP had been.

So I ended up going and working at a boutique investment bank for a year after that, and that also wasn’t quite to my liking. It felt more like gambling than it did deterministic software development. But that was right in the middle of the dotcom boom, so in 1999. I came back to Silicon Valley and started a company as a co-CTO. It was called WeGo Systems. No open source in there, but it was a content management system basically for hierarchical web presences. It was pretty neat… But it run into the end of the dotcom boom, which became the dotcom bust, which was a pretty interesting experience to live through… And that’s where that project/company ended. It was actually sold.

And then Peter had actually done a similar move in terms of doing his own dotcom startup. He also ran into the dotcom bust and started at Google. And it was at Google in 2002 Peter said “Hey, you’ve gotta come work here. This place is amazing, and things are going great.” Which was a strange thing to hear in 2002, because 101, for example, if anyone on this podcast is from California - I imagine quite a few are - it’s this highway that runs North-South in California, and in the dotcom heyday it was more like it is days, pre-Covid… Just absolutely jam-packed with traffic at most of the reasonable hours of the day… And after the dotcom bust happened, it was like tumbleweed blowing across 101. It was a really sad and sort of desolate stretch of highway for some of the busiest hours. That’s what it felt like.

Google on the other hand was just blowing up. It was a wonderful place to work, with this exuberant culture, and everything seemed to be going right. So within three months, Peter started there, I started there, and Ben, the third co-founder for Cockroach Labs started there. And we all started working together on just an incredible diversity of projects.

I’m not sure I’ve ever talked to anybody who has actually built a company into the bust of the dotcom era… So what kind of scars did you take away? What kind of learnings did you take away from that era of your life, into maybe that still helps you make decisions today?

I think one piece of advice I’d give any potential entrepreneur is start a company only with people/co-founders that you have been in the trenches with. Preferably for considerably longer than a year, but I’d say at least a year. The trenches means there’s been shells whistling over your head, and not enough to eat for some of the time. It needs to be some good times, but also a lot of bad times… And if you still maintain a lot of respect for folks that have been with you in those situations, I think they can make really good co-founders.

[20:08] I’ve started three companies now, and Cockroach Labs is the only company where it was just strictly co-founders that I had already been working with for, in this case, a decade plus… And that has worked out very well. So you just really want co-founders to be people that you truly understand and respect.

Yeah, I suppose when you first start your career as an entrepreneur you might have to get in the trenches with just anybody, to some degree, which is where that advice comes from… Because you might eventually get your own battle scars and learn that lesson the hard way, like you may have done… But you might be so eager to begin that you’re like “I will partner with anybody. I will go to a meetup to find my business partner.” Which does happen successfully in some cases…

Right, it absolutely does

But I do agree with that - in the trenches I think is where life happens… And life is not always fair, life is not always fun; sometimes it is. But being able to respect and appreciate the persons and/or person next to you that is leading your company is vital.

Yeah, very vital. And that kind of leads the other piece of advice I’d give to entrepreneurs - exactly as you say, sometimes people just can’t wait. And that’s fine. I wouldn’t say delay your startup idea if you’ve got one that’s inspiring and you really believe in. On the other hand, if you only feel a mediocre pull of gravity, let’s say, for your startup idea, the recommendation I’d give to people is work at a company that looks like it’s really going places. I think the sweet spot would be a startup that’s pre-IPO, that is between 100 and 500 people, it really looks like it’s starting to win its category… That is a prime and fertile experience, where you are going to meet people in the trenches that you will wanna start that next company with… And there’s a lot of ways to learn in sort of a negative sense what doesn’t work. You could spend an entire career doing that… And that experience is valuable. That’s sort of battle scars, and “Hey, I’ve seen this done before and it didn’t work out so well. Maybe we should think of an alternative.”

But I think the sort of positive learning experience where you go somewhere and you see a company that has a great culture, that seems to really be succeeding - those situations attract the very best and brightest. So you end up with a reputational – let’s call it an experience that really gives you a reputation that can help you in terms of getting investors and attracting people to work at your new venture… But you also, as I said, end up being thrown into the trenches with really good soldiers (to keep that metaphor going) and that does end up being the lifelong friends and collaborators.

You’re also networking, too. When you’re at a company like that, you’re obviously gonna be around people who have ambition, have desire for success, they’re able to get hired by a company like that, stay employed, maybe even ship good stuff and deliver on what their promises might be… And people undervalue early on how to get to a network, how to build a network. I think you just start… You just make friends with one person, do your best to keep connected, and rinse and repeat.

[23:42] Yeah, absolutely. I can’t tell you how impressive some of the outcomes of the folks that I worked at Google with back in 2002… Just the diaspora of that cohort of Google employees is something to behold. So yeah, it’s exactly your point - there’s exceptional people, and that’s really how you do the real networking. I’m not saying you can’t do it on LinkedIn; it’s a great tool, but really working on solving interesting and difficult problems with the best and brightest - that’s how you do the networking, and the only way to start is just to put a foot on the path and start walking.

So when did you encounter the problem that you’re solving today? I know you’ve got some experience at Google, obviously… I understand you were at Square for a bit, you had a startup called ViewFinder, which you have since sold… You’ve got a lot of in-the-trenches, bloody knuckles, and even time in the trenches with Peter and Ben, your two co-founders, to kind of get to a problem set, which is usually the crux of why you’re doing today what you’re doing today… So how did you get there and what is that?

Yeah, so databases - it turns out that they have been extraordinarily essential in my career, back as early as the dotcom startup I did, WeGo Systems. We built sharded Oracle and sharded Postgres is the two sort of flavors we supported. And I’ve gotta tell you, when I was at Berkeley I wasn’t very interested in databases. I mentioned graphics - that was really probably my key interest. Databases - I didn’t take until my first and only year of grad school, and I just kind of took it to get some credits.

I ended up being pretty interested in the course, but I didn’t really think they’d be central to my career, but as soon as I hit the “real world”, databases became a central problem, a big source of frustration at WeGo, and then when I got to Google, that was one of the first projects I got thrown onto, which was the AdWords system, which was nascent then in 2002… But it was running into problems with sharded MySQL. And you hear this word “sharded”, but for listeners that aren’t aware of what that implies, it’s about taking a monolithic database like Postgres or MySQL or Oracle that really is meant for a single machine, even if that machine can be quite large… And you say “Well, maybe this isn’t gonna be large enough”, and this is the case of AdWords when I got put on that project.

So you say, “Okay, we’re gonna use two databases. We’ll put half of our customers on the first database, half on the second”, and maybe at some point you start reaching [unintelligible 00:26:22.09] on those two, and so then you say “We’re gonna use four” or “We’re gonna use five” etc. It got up to about 32, I think, when I was at that project at Google… And all these different problems started to occur as we sharded. The application complexity became quite high. It just went ridiculous…

Practical example - the MySQL databases had too many connections coming into them, and that started to cause them to [unintelligible 00:26:49.27] And so we solved these problems – every morning we had these Ads War Room to solve the latest set of problems related to this scalability challenge with the database.

I will just say that in Google AdWords, by the time they replaced that sharded MySQL architecture, they’d gotten to a thousand shards. So it became a thousand MySQL instances. And I’ve heard that Facebook has hundreds of thousands of MySQL instances. So there’s kind of no end to both how scalable that architecture is, but also how much time you have to put in to truly keep scaling it.

So that’s a scalability challenge… There’s also resilience challenges, and that’s part of what we saw when we were at Square, and it was certainly something we saw when we were at Google… And that is you really don’t wanna have a database that has a primary and a secondary… And that’s been the standard way to operate databases for most of my lifetime. The problem with that solution is that the secondary is getting an asynchronous replication stream for data. And even if you put in another datacenter so you have a really nice failure scenario, so you can lose a data center and fail over, that failover might imply data loss… Because that asynchronous replication stream might not have fully made it over to the secondary when the primary dies. So you switch over to the secondary and you realize “Wait a second… I thought I just sent that email out”, as an example. But it’s not in my outbox. What happened? Well, the replication stream just didn’t get that email into the outbox on the secondary. So it’s almost like you’ve moved backwards in time. You’ve regressed to an earlier version of the state that you had in an application, and that causes huge headaches.

[28:37] If a data center was lost at Google back in 2004, let’s say, it would be many teams scrambling to figure out what might have gone wrong. “Did we charge a customer twice? Are there consistency problems in the data because some of this stuff got replicated and some other stuff didn’t?” And you’d have to write cleaners and scripts that would go through things… And you’re just trying to reason through what might have gone wrong with your use case. That’s not the right way to do database replication, and certainly not in 2020.

Google started to play around with better ways to do that as early as 2004-2006. They built Bigtable, and then they built something called Megastore, and then they built something called Spanner… And Spanner is really what inspired Cockroach. So there’s scalability, there’s resilience… Those are two of the biggest problems that I’ve faced with databases in my career.

The gold standard these days with databases is to do what’s called consistent, synchronous-based replication. The popular ways to do this is something called Paxos, or something called Raft… And what they do is consensus. So instead of just writing to a primary and asynchronously replicating to a secondary, you actually write to three data centers, or three replication sites, and you are going to be committed if the majority of the replication sites respond positively or affirmatively to any particular write. If for example you only write to one out of three data centers, that write can’t be committed. You need two out of the three. As long as you have two out of three, if you lose any one data center out of those three, you always are guaranteed that one of the remaining two has the exact data that you need. So as long as you only lose the minority, you have total operational continuity. It’s hard to overestimate just how important that advance is for running these systems operationally.

And you were in an era when this didn’t exist; you had to invent it. You’d mentioned Spanner was inspirational, to some degree… And even as you talked through the problem, it reminds me a little bit like RAID for hard drives, for example. You might choose RAID 0, because you want super-fast disks, you may choose RAID 10, or RAID 5, or a couple different other flavors… Essentially, it’s how many disks you can have lost at once, and it’s similar; it’s like, how many databases, which is literally what a disk is - it’s a database of your files, right? It seems a lot like even that at a small level… Why did it take so long, do you think, to hit the problems of sharding with MySQL, Oracle, Postgres or other, to get to that point? Was it technically not possible until around that time, or was it just like no one thought about doing it?

That’s a good question… I’m just gonna kind of think out loud on answering it. Certainly, your analogy to RAID disks is very accurate; that’s exactly what it’s like. I mean, not exactly, but it’s pretty similar.

Principle, yeah.

Yeah. The reason that – well, let me just say this… There’s nothing new under the sun in computer science. Or maybe the number of new things are vanishingly small. Everything’s been thought of before, so making sharding more automatic - this has existed far earlier than Google created Bigtable and sort of launched the idea of NoSQL. NoSQL - the word NoSQL, the term, predates Google or at least Bigtable for five or six years… At least the earliest mention of it that I’ve been able to find.

[32:13] So ultimately, the popularization, as opposed to the innovation of these kinds of things, whether it’s consensus-based replication, or elastic scalability in a cloud-native fashion - I think the popularization of these things and the widespread adoption has to have a lot of different confluent factors all aligning… The cloud is a big example of why these things are possible. Google had their own version of what looks like the public cloud [unintelligible 00:32:43.02] in 2020, they had that in the aughts. They had data centers all over the world, and Borg to control access to resources in a very frictionless fashion.

Once you start to have capabilities like that, you start to think that “Hey, we could write databases differently. We could use all these commodity resources and build a bigger database than anyone’s ever had.”

Another factor that really was instrumental to driving some of this innovation was the fact that after the dotcom boom, the idea of enterprise scale gave way to a whole new level of scale, which you could call web scale. That’s what people have called it. And I think there’s additional levels of scale that are on the horizon, or are probably already here.

When you think of why you need something like Cockroach, which is an operational – what they call Online Transaction Processing Database (OLTP), the idea of needing an OLTP database that could be petabytes or even exabytes is pretty foreign when you’re thinking about Oracle in the ‘90s, where it was used by an enterprise, and you have maybe ten million customers, the biggest-size enterprise… Google started to say, “Okay, we might have a billion customers, and we need to store all that data.” That’s just a hugely different problem, and it demanded additional architectural innovation for the database.

Yeah, true.

But now what we’re looking at is something that goes beyond the number of human beings that have smartphones. We start talking about IoT, and we start talking about virtual agents… Basically, anything that could hit a company’s API, which interacts with a service that they’re hosting that has a database that’s backing it. It used to be how many human beings had desktop computers. Then it was how many human beings can operate smartphones. Now it’s how many potentially non-humans can take some agency and access an API, causing a database to be involved. That number is already in the hundreds of billions, and it’s going to go to the trillions. So the demands of scale are probably pretty limitless when you actually look to the horizon.

But all of these trends, the alignment of them is what pushes what might have been a research paper in the ‘90s, which is the case of Paxos, into production systems. It’s just the demand has to be there, so the stars align. It’s really interesting to watch it.

Yeah. Basically, you don’t create software you don’t need. You create software you need today. So software that’s in place, successful, adopted, useful etc. is because it has a use. So as the need changed, the idea of multiple data centers etc. the need for how a database needed to work changed beyond what previously had been in place… And you needed a new look, a new database based on new infrastructure, new problem sets.

Yeah, I like that. You don’t build what you can’t use. That’s exactly accurate. And if you do, you’re probably wasting your time.

And you don’t use what you shouldn’t use. Sometimes you’re not Google and you use Google tools… “But I’m not Google, so I shouldn’t use Google tools. I should use the database that makes sense for me and my problem.”

That’s right.

[35:55] …which is a whole different subject. So you’re at a point now, obviously, where you’re in the trenches with the right people, you’re building the right technology, potentially being inspired – did Cockroach the software product, the initial of it, did it begin when you were at Google? Did it begin when you were outside of Google? How did the beginning of it happen? When did you first try it, ship it, see it be used by something else? Take me to that timeframe.

Yeah, it was when we left Google. So that was 2012. We had been there just under ten years. Great time, but ultimately, it felt like it was time to do something new. I even thought about going back to school; maybe I’d get an MBA, and kind of take a – an MBA is really a two-year vacation, where you network. That sounded pretty good to me. I thought maybe I’d go back and become a doctor.

I just felt like I didn’t necessarily wanna spend my whole life being a Google engineer. It didn’t matter how much fun or how challenging the work was; for me, that was just part of my internal calculus. In the end, we decided “Hey, we could do another startup.” And what Viewfinder was - it was private photo sharing. The same time that Snapchat was getting started, we were getting started, and I think we did build the right thing… Snapchat clearly did… It was really an amazing experience overall, of course; but when we left Google, we were a bit disappointed by what open source databases and open source infrastructure looked like in 2012, compared to what Google had been aggressively building. And that’s where the idea of Cockroach was initially born. It was really “Okay, well Spanner is great. We wanna have Spanner-like capabilities. But it has to be open source, and it has to be something that you can run on a laptop, and it has to be something that any startup could use.” And the idea of calling it Cockroach is really because cockroaches are so damn resilient. They say after World War III they’d be the only things left alive… It’s probably true actually, based on my experience living in New York…

All the way to WALL-E!

That’s right.

The movie WALL-E, that cockroach would not die. It would last through everything. It could be squashed and would bounce right back.

Yeah. So I think it was during the early days of Viewfinder that – again, it was another manifesto. I like writing those. It’s kind of like “Okay, well, what exists right now doesn’t work well enough. What would we –” It’s fun to write without thinking about the practicality of any particular prescriptive solution, but what would be the ideal solution to this problem, if there were just no barriers or limits?

Obviously, it’s still grounded in what’s conceptually possible, based on what I knew… And the beauty of having come from Google so recently is that the blueprint at least of the capabilities was very well understood. I mean, they’d just published that paper, too. And that manifesto was super-fun to write, but it was just this idea that, okay, there’d be these nodes, this commodity hardware, and I was thinking of AWS EC2 at the time, and every node of Cockroach would essentially colonize the disk space you gave it, and it would try to reach equilibrium, but it’d also be greedy about making sure its data was replicated to any neighboring nodes that it would coordinate with; there wouldn’t be any actual leader or central points of failure. Everything would be cooperative, with well-understood protocols. But capable of independent operation where necessary. And that was a super-fun thing to ideate.

[39:48] Ultimately, we were trying to build private photo sharing, not a database. So that project really was a passion project that had to be put on the backburner. We were then acquired by Square a couple years later, in 2014. And when we got to Square, they didn’t really have a fixed project for us to work on, so we went around, talked to a lot of different groups, and a theme emerged… And it was the theme, as I’ve already mentioned speaking with you, that has been prevalent in my career, which is “Databases are a significant problem.”

At Square, I think a lot of the problem was “How do you make sure that applications that are database-backed can survive a data center outage?” And not just survive it in a kind of half-working fashion, but to really have business continuity; no post-mortems for application teams.

Payment processing was this seminal example at Square. If you started authorizing a credit card and then finally charged it, or canceled the transaction, that’s sort of a two-step process. And if it gets interrupted mid-stream, so you authorize the credit charge, and then you aren’t able to cancel it or confirm it, you might end up authorizing it twice when that thing restarts and you failed over to a different data center. And that was problematic from a customer perspective. You don’t wanna get that get that alert on your phone that you’ve been charged twice, and then that causes problems for Square, and so forth.

All those kinds of problems, if you don’t have a good solution to the real guts of the problem, the core… I laid out a fairly simple scenario, but the problem is these use cases - they get more and more complex, and the burden of maintaining it when there’s gaps at scale becomes very onerous. So that was a big learning at Google - anything that can go wrong, any gap you have where you’re like “Yeah, well, it’s pretty unlikely this is gonna happen”, trust me, at scale it will happen, and it will happen, and it will become a huge problem that will blow up on you.

So theoretically, when you build these kinds of systems, you do not want to have any gaps. Like, zero. Everything needs to theoretically work perfectly, even with disastrous scenarios that you don’t think are gonna happen. Like, weird network partitions that are going to be so obscure that you just can’t imagine they’ll happen. Boy, they’ll happen, and they’ll happen in like a month or two, at scale.

When we were at Square - just to pick up that thread again - we came to the conclusion that Cockroach as we had originally conceived it, its time might have come. I lobbied pretty hard for Square to support the Cockroach project… And there were definitely some people that were on board with it, and others that weren’t, and ultimately Square said that I could work on it, but they weren’t going to really adopt the project. So we started as a GitHub side-project, and I worked on it my nights and weekends, and eventually I was able to work on it full-time, while I was at Square, which was really an amazing time in my life.

For about six months, every day I’d come into the office and I’d say “Great, what’s the next problem? How do I build the very best database I can conceive of?” And there weren’t any customers, or managers, or any process…

No one stealing your time, yeah.

Yeah, it was wonderful.

Focus, right?

It reminds me of John Wick. He’s a “Sheer will, a man of focus.” It’s like, what could you do with complete focus, right?

Well, I think as any really dedicated programmer knows, those stretches of that sheer focus are some of the most pleasurable moments in the trade, or in the craft… And I think that’s true of any artist. When you get into that flow state, it’s meditative in terms of its quality, and it’s like a deep state of happiness.

You were at a point where obviously you were really enjoying it. You mentioned this six months of working straight on it… I’m assuming at some point you’re gonna depart from Square and rethink your life, and get influenced to take investment and create a company… Is that roughly what happens next?

Yeah. Well, the interesting thing about Cockroach is - to our earlier conversation, it was a technology whose time had come. People - I think their appetite was wetted by Google’s paper about Spanner.

“Well, we need this.”

Yeah, interesting. Like, “Who’s gonna build the open source Spanner?” Kind of like Hadoop was the open source MapReduce, and there’s other examples. And that was true more generally; not just the VC community, but developers everywhere that were interested in databases. We had a lot of stars on GitHub, and that ultimately led to a number of VCs coming around and wondering whether we were interested in taking money and really making this a commercial entity.

I remember the idea was a little foreign at first, just coming out of a startup, and actually enjoying my time at Square… But I realized I really want to build another open source system. I think that was one of the most rewarding things that I’d done so far, writing the GIMP. And I felt like Cockroach could really be extremely useful, and something that existed long after I stopped working on it; maybe even after I was no longer alive. It felt like it could be a system that really meant a lot and added a lot of value.

So I convinced Peter and Ben, which wasn’t – Ben was totally on board with it. Peter was thinking that he might wanna go back to Google to work on Go, or something like that… I said, “Come on, Peter, I know our last startup wasn’t the huge success we hoped it would be, but let’s jump on the bandwagon again” and build something that we’re probably a bit better at - distributed infrastructure, as opposed to a private photo sharing company where you have to understand the fickle desires of the average consumer, which is maybe something I’m not so good at.

Yeah. Well, you’re at a series D, which means you’ve gotten several rounds of funding so far, which means people believe in you to build what you’re trying to build. I know tons of people that use Cockroach and love it, so congrats on that. I know that Cockroach Cloud is now a thing and you’re doing well with that… In terms of, I guess, success of a business right now, how do you feel you’re performing as a business?

Really well. There’s always just existential concerns starting any company, and there’s been so many stages of growth… The early days when we were pre general availability, we had alpha, and then beta - those we could move so quickly, and it was extremely enjoyable. It was jut R&D. Building a relational database from scratch, from the top to the bottom, is a huge undertaking. And those were, I think, some of the most enjoyable, just because of the extent of the challenge.

[48:03] But then teams started to grow, so you’ve got cultural issues, and you have to manage so that everyone is pulling in the same direction, instead of everyone doing something useful but pulling in opposite directions… And then you’d get customers, and you’ve gotta respond to all of their issues and make them successful…

And then it’s kind of like you’ve seen the crossing the chasm idea, where there’s this bell curve of adopters, and you have those innovators, and that’s kind of where Cockroach and most of these kinds of technologies start… And then you get to the early adopters, and the early majority, and where are we in that journey… It’s just, every new tranche of customers or people that are interested is a whole new challenge.

When I look at everything we’ve done, it feels like we’ve come a long way, but when I look at everything that we need to do, at least what I can envision, it feels like we have a heck of a long way to go… So I think it’s anything but certain that we’ve truly succeeded as a commercial entity… But we’ve come a long way. We have some of the biggest companies in the world now using Cockroach. And that includes the real blue chips, but it also includes the really fast-moving, high-tech growth companies. So both of those - extremely exciting.

I think when I started with Cockroach I was maybe a little intimidated or unsure of whether building something that would be enterprise software is really what I might be good at. But I’ve found that helping these bigger companies adopt cloud-native architectures and infrastructure is extremely rewarding, and that’s something I’m happy about. But as we were talking about at the beginning of the conversation, the real challenge is how do you build and deliver Cockroach as a service? And that’s where I think the future of our success is going to be made or lost. It’s a transition.

Right now, the world’s biggest companies - they wanna run a relational database themselves, they wanna self-host, they wanna buy software licenses. They might wanna put it in private data centers, or hybrid across private and public clouds. On the other hand, in five years, even those companies, much less every other startup and high-growth tech company - they’re all going to be using database as a service. In ten years, the entire world will be.

So we have to not just win where we originally set out to build CockroachDB the way that you might run Oracle or Postgres or MySQL if you were running it yourself, but we have to also now succeed with Amazon as a direct competitor, and Google, and Microsoft. These big clouds that are offering databases as a service and doing quite well with those businesses.

So how do we deliver Cockroach as a database as a service, and effectively compete? There’s a lot of really interesting answers to that question. It’s by no means a foregone conclusion that a company like AWS, which is the cloud vendor incumbent, really has as many advantages as you might think they have.

How do you do that thing? Because on the landing page for Cockroach Cloud you say “Cockroach Cloud is the simplest way to deploy CockroachDB and is available instantly. Here’s the key. On AWS and Google Cloud.” So what’s your current answer? I’m sure over time your answer will evolve, but what’s your current solution to competing with these big players?

Well, there’s a number of different aspects to the successful strategy, and as you say, ours will continue to evolve… One is you out-innovate. I think Google is probably the only of the cloud vendors that has a truly comparable technology. Amazon’s better at repackaging existing open source… And part of that out-innovating is – you may have read, we’ve made some license changes to the core of Cockroach. We adopted something called the BSL. That’s part of how you continue to out-innovate. It gives you a little bit of protection.

[52:10] Then there’s the idea of being multi-cloud, or cloud-agnostic, and that includes private clouds. So the deployment flexibility is extremely important to the world’s big companies that have been around for a couple of decades and have lots of existing investments in data centers and high-value use cases that aren’t gonna be easily moved to the public cloud. I think that is incredibly important.

Part of something that’s worth touching on further is just how much innovation can be done in the database as a service model. And that’s something that we’re pushing really hard or right now. Ultimately, we’d like to deliver databases with a lot less friction than they currently are delivered as a service.

Right now when you get a database as a service, there’s quite a bit of cost to it. Like, a sort of production-ready, encrypted instance of RDS, that’s sort of the minimal footprint - it still costs you about $100 a month, which is a lot. And you’re choosing the size of nodes, where those nodes are located… There’s a number of decisions that increase the friction of the process. We’d like to drive to a world where databases are truly serverless, in the sense that when you get a relational database, it’s something that you can pay for exactly what you use, not worried about what kind of machines, how many, and even where they’re located. You just get a database, and that database is truly capable of global operation. Hey, if you only use it on the East Coast of the United States, great. You wanna add the EU? That’s extremely simple. It’s as simple, essentially, as setting a different value for a column in a table specifying what region the data should be stored in, or whether it should be global, as an example.

And further, we actually think that price is a major impediment to using something like a relational database as a service. We’d like to make these things perpetually free for developers, for a pretty generous tier. So think about what Gmail did in 2003, where they were effectively making a gigabyte of email free; at the time, you had Yahoo! –

It was unheard of.

Yeah. It was like 5 megabytes what you got before, which you filled up with one mp3 somebody sent you, or whatever; a couple photos. So this is a huge innovation, obviously; it just set a new standard for what web mail should feel like. And while Gmail is free, if you want a hundred gigabytes, you pay for that extra storage space. That’s exactly how Cockroach Cloud is going to feel to a developer. We wanna make perpetually free relational databases that are the seed of an extraordinarily powerful production database. Something that can scale to run retail banking for the world’s largest banks, that has geo-replication for a high level of resilience, and that is capable of truly global operations, so that even a startup could use the free tier of Cockroach Cloud and store data for customers in Brazil, in Brazil. Store data for customers in Japan, in Japan, and give them a local experience. That’s how big tech builds services and applications. We really wanna make that so that every company in the world, even every developer, even in a hackathon, can build that way. And it’s ideally easier to build that way than it is to stand something up yourself in a single availability zone.

That’s ambitious, for sure, because one of the hardest parts is adoption, and you’re guaranteeing that by enabling that perpetually free tier that’s generous, so that you can tinker in a hackathon or scale your enterprise, and it’s the same Cockroach Cloud; it’s the same cloud. It’s not a different version of it, it’s the same version, regardless.

[55:54] Yeah, we want that to be a very continuous product experience, and I think the journey that is the most evocative for me is you’re starting a company, which I’ve done; ViewFinder is the canonical example I always use in my head. How much easier could we have made the ViewFinder experience…?

Nice, yeah.

And that’s great, to have that experience to make product decisions; it’s pretty fundamental. But the idea would be hey, you wanna stand up your database pre-production, but you have developers that are pinging it, and so forth… You certainly don’t have to pay for that. You don’t have to have this big server that’s sitting there, that’s almost completely idle for months…

And then you launch the first version of your software, you get something into the app store, maybe it’s in TestFlight or something, and you have a hundred beta users that are poking at it, and so forth - you’re still under the free tier, for sure. It’s only when you really scale to get more product-market fit and you start having sustained high throughput, then you start to get into overages, and you can pay for exactly what you’re using over that free tier threshold.

And then, eventually, if your startup continues to succeed, you’re gonna want to move to sole tenancy, a dedicated cluster, as opposed to the Cockroach Cloud free tier and the overages where you’re sharing a multi-tenancy cluster with other users. So for infosec reasons, so that you don’t have noisy neighbors and you have very guaranteed throughput, exactly what you expect, and there’s no variance in terms of your latencies and response times and so forth. And also, in order to truly scale to big sizes, where the cost is more economical - that’s where you’d move to the dedicated cluster, and there you can scale to really as far as your ambitions.

It’s very similar to the VPS analogy, where you might be on a virtual private server, you maybe have some noisy neighbors, to use your example, but if you can go beyond that, maybe you get your own dedicated virtual private server where you’re not sharing, you’re not in shared resources, you have your own dedicated… That seems like a very similar analogy. So if you get that from that world, then you will get that in the database world that you’re creating.

That’s exactly accurate as an analogy. And what’s really wonderful about this capability that we’re building - think of it as virtualizing a big Cockroach cluster, and allowing many tenants to share those resources; that’s also something that’s extremely interesting to big enterprise customers. They would like to have their production use cases also run on a multitenancy dedicated cluster. So one of these big clusters that we might have public – you know, any developer can sign up with their GitHub OAuth login. But you might deliver that to a financial services institution as a dedicated cluster, but their internal teams get to share those resources in a pool. So they don’t have to say “Okay, for each one of these production use cases we have, we’re gonna have completely dedicated hardware, which we have to make sure is size, so we can handle our peak throughput…” - that’s a lot of wasted resources over a hundred production use cases. If you can pool all those resources and allow the overages from one to use additional resources from others that might not be at peak throughput, then you get to have much more efficient resource sharing.

So what we’re building for the public at large to really connect CockroachDB to the large audience of developers out there in the world is also something that is extremely valuable to the high-end dedicated companies and customers.

It’s interesting how the ideas translate from small to big, and big to small. That’s interesting. Let’s close with this… I didn’t preface this with you, so this is sort of a curveball to some degree, but… What’s lesser known or not known at all to, say, the general developer world of what you’re doing? So what’s on the horizon for you that not many people know about, that you can share here today?

[59:58] Well, a lot of what I’ve been talking about I’d say is understood by still a small audience. That’s something to always keep in mind, that crossing the chasm thing. I think that the large pool of developers out there - and there’s ten million of them in the world - the majority of those have probably never even heard of Cockroach. That’s also interesting. I imagine people that listen to your podcast are closer on to that innovator side of the bell curve.

The thing that I think might be extremely interesting, that isn’t necessarily obvious from what I’ve already talked about, is just what we think the 2020s holds in store for even a developer at a startup, or a developer at one of the Fortune 500 companies, and Fortune 10 companies even… And that’s really not just a database that’s serverless, but an entire stack above that database. If you really wanna build an application the way that Facebook or Uber or Netflix builds them, so that wherever you do get customers around the world, you can give them what feels like a local experience, it’s more than just a database. The database is clearly a foundational layer in the stack, but you need to have an execution layer as well above it. You’re certainly gonna need additional systems that are also global; you’re gonna need global DNS and global load balancing, and so forth.

So really what’s on the horizon for us is “How do we partner with the clouds, with other technology companies that are complementary to what Cockroach Labs is doing, in order to define the next generation of stack?” You remember the LAMP stack, which really drove a lot of the innovation in the aughts and beyond; the big question for us, and I think what’s extremely exciting, is the emergence of a stack that allows a startup or a Fortune 500 company to build the way that Google builds and operates services and applications.

I think that’s where a lot of our thinking, and I’m sure a lot of the thinking of all of our contemporary peer companies is going to be directed in the next five years. And part of that I think is 5G, interestingly enough. It’s pretty unusual that there is a significant improvement in latency in communication networks. It’s much more common that you have significant improvements in bandwidth. Latency improvements happen somewhat infrequently, and they usually herald quite a bit of innovation. So I think the widespread adoption of 5G in the next five years is going to mean that applications, especially on a smartphone, can feel substantially different than they do today.

I think everyone’s pretty used to hitting a button on a smartphone, and maybe a second and a half later something changes. That is a pretty bad user experience, but it’s just one we’re all used to. Ultimately, you want that to be the 100-milisecond rule, as popularized by Google Gmail, and now more recently Superhuman, which is another email application. And 100 milliseconds is the threshold for a human noticing something as taking time or being instantaneous. Less than 100 milliseconds is instantaneous.

So if you can actually adhere to that latency end-to-end, in other words you hit a button on your smartphone and you get a response all the way up to the backbone, into the – across the backbone, to wherever the data center is, through the application logic, into the backend database, and then all the way back out, that roundtrip time, less than a hundred milliseconds, you can give people real-time experiences. And obviously, for gaming, interactive media of all sorts, self-driving, AR, VR - these are obvious use cases, where this kind of latency guarantee is transformational, maybe even necessary…

[01:04:04.12] But I actually think that as this becomes both more desirable - and that will happen by degrees at first, and then all at once - but also more tractable… Like, it’s not just Google being able to build these things, or Facebook, but a startup; even a hackathon - that’s like the gold standard, a litmus test, in my opinion… Then you’re going to see lots of innovation, and even things like what happens when you post on Twitter, or what happens in your Facebook feed, news feed, as little things start to happen and you start to see more than just a couple dots going across the screen when somebody else is typing, but you actually start to see genuine interactions - that’s gonna make the virtual world that so many of us is spending so much time in feel substantially different. And applications that don’t start to feel that way will increasingly feel antique, and sort of out of touch and clunky.

To our point before - why did these technologies find such widespread adoption, and all these stars have to align when there’s a huge demand that catalyzes across the ecosystem, that will be what everyone’s building for in 2025, and that’s really where we’re interested in setting our sights.

That’s an interesting perspective. Just for humor me, is 100 milliseconds basically one tenth of a second? Is that what it is?

That’s right.

I had to grok that in my own mind, and I’m thinking “Listeners, just so you know, 100 milliseconds is a tenth of a second.” So what you’re talking about is quite a bit of an operation, to go through the client device all the way through the stack and back again, in one tenth of a second.

Yeah. And what’s interesting is you simply can’t do that for a user that’s in Sydney, Australia if your data center is in Virginia. It’s just not possible.

It’s too far, yea.

In fact, it’s gonna be half a second. And you think, “Well, what difference does half a second make? That’s kind of ridiculous.” Well, Google’s found that their search results, if they take 200 milliseconds instead of 100 milliseconds, or 300 instead of 200, there’s this incredibly consistent relationship that they observe between how many searches people do and how much of a latency they experience. Even down to these fractions of a second. And the curve is reproducible, and especially over the amount of data they collect, it’s extremely consistent. And that is a little bit mind-blowing, but how do you solve that problem? Because the speed of light really, and the speed of networks aren’t gonna allow you to get that Australia user a local experience, you have to expand what your data architecture looks like, what your whole stack looks like, so that you’re really running a global architecture, so that there’s application logic in Australia, running on servers in Australia, and there’s databases that are running on servers in Australia. That’s the only way you’d really do it. And that’s also great, because a lot of countries are introducing data sovereignty regulations, and they don’t want users’ data, especially if it’s personally identifiable, to exit legal jurisdictions. And users don’t want that either… So how do you grapple with all this stuff? And the answer is “Okay, if you’re Google, you just build it all.” If you’re anyone else, you simply don’t. You try to get everything to sort of work, out of a single availability zone.

In order to solve this problem for a much more general audience, it’s about improving the infrastructure. So that’s what we’re doing, at least; we’re pushing a lot of those capabilities and smarts down into the database.

Very cool. Spencer, thank you so much for spending this time with us and sharing your story, and Cockroach’s story, and this look into the future of what networks might be like, and how you’re planning for them to be reliable. Not so much the network, but the data that might transpire there and the partnerships you might form as a result of this newfound lack of latency in our future communication networks… So thank you so much for sharing your time today, and I appreciate you coming on the show.

Yeah, it’s been my pleasure. Thank you, Adam.
+

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00