Stephan Ewen, Founder and CEO of Restate.dev joins the show to talk about the coming era of resilient apps, the meaning of and what it takes to achieve idempotency, this world of stateful durable execution functions, and when it makes sense to reach for this tech.
Stephan Ewen: Yeah, yeah, yeah. Yeah, something like that. I think durability is probably the same as persistence maybe, with a bit of a stronger emphasis on it really doesn’t get lost after it happens. So durability is the D in ACID when it comes to databases. Databases say “We’re giving you atomicity, consistency, isolation, and durability. Once you do an update, we’re not going to lose it.” No matter what crashes, the database has a mechanism to bring that change to the database back. If I told you I’ve recorded that row, I’ve recorded that change, it will be there, no matter what. And in the context of Restate, that doesn’t mean – for example, the core building block of Restate is a stateful, durable function. You can think of it like that. And the stateful, durable function, when you schedule an invocation for that, or as you go through the code of that stateful, durable function, it has like multiple steps, recording a step. Whenever you go beyond a step that you asked Restate to treat as durable, you know that no matter what happens, you will never re-execute that step. You’ll never come up with a different value. Like, if your machine goes down, the Restate server goes down, if you deploy it across availability zones, the data center goes down, the network gets partitioned, whatever, you’ll never ever go back and re-execute that step if it once told you that it’s done it. TThat’s sort of the meaning of durability. Once it says it’s there, it’s always going to be there. And I think this is, in a way, almost one of the magic ingredients.
The way Restate looks at making distributed application development simple… I’d say there’s two core pieces that you need to think about. One of them is the durability; make durability extremely fine-grained, and extremely cheap. Because if you can apply durability in fine-grained steps, you always have to worry about very little after a failure. Let’s say your durability is coarse-grained. Let’s say the other workflow is one durable step, and it crashes in the middle. It gets retried. It’s up to you to figure out “Well, did I actually process the payment already or not?” Maybe there’s a way to just like assume “Okay, it’s idempotent. I can send it again.” Or I might even not be able to ask the service “Did I do that or not? Did I actually decrement the available kind of product already or not?” Maybe I have a way to, again, make this durable, or not. I don’t know. These things tend to be harder than one thinks, because sometimes the API gives you – you know, it might’ve given you an error back the first time, and you thought “I didn’t do this” and followed some control path flow… And then the next time you actually get, not an error, but the real result, and then you follow a different path… So people mess up this all the time. It’s really hard to reconcile if you have these multiple steps as [unintelligible 00:35:08.20] atomic unit. “What did I do? How did I do it the last time? How do I recover from this?” But if you have extremely fine-grained durability, if you’re recording every individual step as durable in the system, and when it comes back, it can tell you exactly like “This was the last step that you recorded”, then you just have a very small amount of uncertainty. “Okay, here’s this one thing that I might have tried already. I have to just worry about that bit”, instead of the whole history, and possible control flow, and all the choices, how I might have ended up here that I need to reconstruct in order to proceed consistently from there. So just like very fine-grained durability is extremely powerful in simplifying things.
I’d say the second magic ingredient is then how do you anchor this in the whole retrying and resolving potentially inconsistent situations with partitions, with timeouts, with zombie processes, and so on, so that there’s always a very consistent view of what the last durable step was. I think that’s the second sort of ingredient of Restate. It’s not just durability, it’s actually durability and consensus, and giving you a very, very crystal clear view on where you left off, where you need to continue from. I think if you take those two things in conceptually, you’ve simplified the problem massively, and the rest is almost API sugar that you built on top of that.
[36:27] That’s, I would say, the magic that happens in the Restate runtime. It’s a very low latency, durable consensus log that fuses queuing, state management, locking, fencing, creating futures, resolving futures… All these kind of operations that tend to be part of a distributed coordination process.