Stephan Ewen, Founder and CEO of Restate.dev joins the show to talk about the coming era of resilient apps, the meaning of and what it takes to achieve idempotency, this world of stateful durable execution functions, and when it makes sense to reach for this tech.
Stephan Ewen: The short answer is itâs really, I would say, the state of the art to build backends that are supposed to â backends that do any non-trivial state management and coordination, itâs completely unsustainable, I think, the way weâre building this today. Just to give you an example⌠Or letâs actually start with an example. Letâs stay with LLM, because we just talked about it, right? So letâs say youâre building a chatbot; youâre submitting something, like a message there⌠This thing in the end has to reach the LLM, but it has to look up the context in which that chat happened before, it has to make the call, it has to go back, store the context⌠You donât want it to just lose everything if you lose your connection in the middle of â letâs go with the F5 thing again⌠You donât want it to actually trigger the same request twice, or lose the entire session, make you start over. So youâre probably just putting this as an asynchronous request that runs in the background, that youâre sending from your from your chat session, from your browser, but itâs a separate asynchronous request that runs, that talks to the LLM. You want it to be actually retrying in case something fails, or is overloaded and itâs throttling your [unintelligible 00:15:19.22] And then you want to be able to reconnect to that task or request in case something goes wrong in your browser, or you accidentally hit the Back button, or whatever. Just implementing this is a surprisingly complicated thing, where you start to stitch together probably a queue, a database, and a bunch of tasks to manage that.
To give you another example - we just talked about Stripe, right? So letâs say youâre sending a request there for a payment, and sometimes they tell you âLook, this is good or bad.â Like, they accepted it or didnât. Sometimes they tell you âI donât really know.â âOur fraud detector is still running.â Or âWe have some weird thing in the background that weâre still asking, and it hasnât told us, so Iâm going to send you a webhook in a moment to tell you whether this went through or not.â
[16:02] And now you have like a synchronous request there, and then somewhere else is an asynchronous request coming up. You just want to make those two reliably meet. Even if this one fails, you want it to sort of like recover somewhere, understand where to reconnect with the webhook that youâre awaiting⌠And this little piece - itâs really just one case handling in the backend, where Stripe says âOkay, Iâm processingâ, instead of yes or no. This is actually many days of work, to make that look reliable, to make that work reliably. And itâs like lots and lots and lots of things like this, that just get in the way, with so many moving pieces, so many⌠Yeah, so many APIs to talk to, so much work. So much more work than originally happening asynchronously, and like in separate requests, than just in the synchronous user interaction. Just gluing this all together has become such a complicated thing that we felt âThis does need a better solution.â
This is like the motivation more from, letâs say, the use case side. I can give you a motivation more from like why we actually ended up doing this. I think this is a motivation that probably lots of folks stumbled across ultimately, and so like âOkay, this needs a better solution.â I think thereâs different projects approaching that problem. Why are we approaching it the way weâre doing? This has to do with where we come from. Before we worked on Restate, we were building Apache Flink. This is a different system. Itâs a stream processing framework. Itâs basically events and analytics. So you have these events coming in, often through a message queue, and you want to aggregate them, join them. A few examples where this is used is like fraud detection in banks. Some payment events go in, you aggregate feature vectors of that through a fraud model, or⌠Things like the TikTok recommender use Flink to actually join information from users and interactions together in real time, and understand how to update the features that will go into the recommendation model. I think companies like Uber use it to determine pricing and traffic models and ETA.
So itâs whenever you have events and you want to analyze them in a way that you aggregate them into some sort of â yeah, typically statistical value, or a materialized view. This is what we were building before. So itâs an analytical framework. What did actually happen then is at some point in time we saw folks were using that thing to solve the distributed transaction processing. The types of things where you would say â letâs assume an order processing service that takes the event, checkout that order, and it has to do a bunch of steps. Letâs say update the inventory, trigger payment, call the service to prepare logistics, maybe call another service to put this in the userâs history⌠Maybe more steps, and so on. And we started to see folks using Flink for that, because it had this interesting property that it had sort of this baked in way of reliable communication and state management. It was all built for analytical use cases. But they found this is such an interesting property that they started to apply this to the transactional use cases as well, like order processing, just because theyâve found that this is otherwise way too complicated to build⌠Way too easy to build it in such a way that it is brittle and not scalable, and it has corner cases for when it violates a lot of these properties that weâve just said.
[19:48] And when this started happening repeatedly, we thought âOkay, but apparently there isnât really a good tool out there yet.â And apparently, this property of like correct, stateful coordination is something people really appreciate. They feel like it makes their life easier to build this type of frameworks. And then we set out to build a solution for that, and that became Restate.
Itâs in many ways, actually, from the way it approaches things, or from its architecture, itâs inspired by our work on Apache FlinkâŚBut itâs almost a complete mirror image implementation of it. It takes almost the opposite design choice in most aspects, because itâs really optimized for low-density transactional processing, rather than high throughput analytical processing, which was Flink. But what we retained from this idea is - yes, stateful orchestration, and event-driven foundation and so on, this is something we should build and we should be working on. And yeah, that became Restate.