Go Time – Episode #297

Event-driven systems & architecture

with Chris Richardson, Indu Alagarsamy & Viktor Stanchev

All Episodes

Event-driven systems may not be the go-to solution for everyone because of the challenges they can add. While the system reacting to events published in other parts of the system seem elegant, some of the complexities they bring can be challenging. However, they do offer durability, autonomy & flexibility.

In this episode, we’ll define event-driven architecture, discuss the problems it solves, challenges it poses & potential solutions.



FastlyOur bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com

Fly.ioThe home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.

Typesense – Lightning fast, globally distributed Search-as-a-Service that runs in memory. You literally can’t get any faster!

Notes & Links

📝 Edit Notes


1 00:00 It's Go Time!
2 01:21 Viktor Stanchev
3 02:10 Indu Alagarsamy
4 03:01 Chris Richardson
5 03:45 What is event-driven architecture?
6 13:57 Orchestration vs choreography
7 19:00 All the tradeoffs
8 40:16 Bringing this to the wild
9 48:00 Risks of change
10 53:12 Final conclusions
11 56:27 Unpopular opinions!
12 56:50 Viktor's unpop
13 58:53 Chris' unpop
14 1:00:38 Indu's unpop
15 1:04:06 Outro


📝 Edit Transcript


Play the audio to listen along while you enjoy the transcript. 🎧

Hello, and welcome to today’s episode of Go Time. Today I’m going to be joined by three wonderful guests, and we’re going to be talking about event-driven systems. So we’ll define event-driven architecture, we’ll discuss some of the problems that it solves, and some of the challenges it poses in terms of implementation and trying to solve problems. And then we’re going to be chatting a little bit about potential solutions. So without further ado, I’m going to introduce you to our first of our three wonderful guests. We have Viktor Stanchev, who is a founding engineer at Anchorage Digital, and a co-inventor of their better than cold storage custody system for digital assets. He has been using Go for almost – exclusively since 2015. And he has been focused predominantly on backend systems, infrastructure and applied cryptography. Hello, Viktor. Thank you so much for joining us. How are you?

Hello. Doing great.

Doing good? Are you excited to be on today?

Very. This is my first podcast, so…

Well, very excited to have you, and for all you lovely listeners. This was actually an episode that was a brainchild of a chat that myself and Viktor had over coffee, so you will have him to thank or not to thank, depending on how good this episode is.

Next we have Indu Alagarsamy, who is a principal engineer at the New York Times. She’s passionate about event-driven-style architecture, so a great guest to have on today. She’s also the organizer of the SoCal Domain-Driven Design meetup. She has over 15 years of experience, she’s worked all over the place - in healthcare, biotech, emergency services… And in her own words, in her mind, bounded contexts plus messages equals microservices. I think we’re gonna need to talk a little bit more about that. And this is also your first ever podcast, so I’m extremely excited to have you on. How are you doing today, Indu?

Thank you, Angelica. Thank you for having me here. Yes, I’m a bit nervous… But I’m with friends, so hey, this is fun.

Yeah, you’re with friends; we’re just here to have a fun conversation, so it’ll be great. And then finally, but certainly not least, we have Chris Richardson, who is a software architect. He’s the author of the book “Microservice Patterns”, and he’s the creator of Microservices.io. And I see you have your books in your background, so for those of you watching on the video, you can see the covers of his books in the background. He helps organizing kind of around the world to try and help people improve their architecture, so I’m sure you’re gonna be bringing a wealth of knowledge to this conversation. And thank you so much for joining us.

Oh, it’s good to be here. A little early, still waking up, but… I’m doing good.

I appreciate it, powering through, and hopefully it’ll be a stimulating enough conversation to keep everyone awake and engaged. Awesome. So we’re gonna dive right in with the absolute basics. So what is event-driven architecture?

Yeah, I mean, I think it’s interesting… I think it’s a slightly fuzzy definition… But I think one sort of common definition is it’s a system or an application where different parts of that application communicate or collaborate using events. And events are – you could say they’re messages or or they represent things that occur within a given domain… You know, like an account was created, or an account was debited, the flight departed… Whatever things occur in your domain.

Awesome. And then Viktor and Indu, do you feel like that definition resonates with you? I know in some of our initial chats, when we were kind of framing this episode, we talked a little bit about how there’s some misconceptions. When people say “event-driven architecture” or whatever it may be, some people may think of it as one thing, some people may think of it as another… So there is really a need to dig deep and define, especially when hoping to implement a system, so that everyone is in line, they know what they’re talking about, they know what this means. I see you nodding and smiling, Indu.

Yeah, I agree with what Chris said. I think also in my mind it is if you have an architecture that is modeled after the real world. In the real world we work asynchronously. We were talking about Starbucks earlier; you go and order a drink, things happen asynchronously. You pay for your drink, and then you wait, and somebody calls you with your name and your pumpkin spice latte saying it’s ready. [laughs] Coffee snob, Chris… So I think if we take that paradigm, what happens in real world, which is all about asynchrony and events, and you model your actual software architecture, whatever problem domain that you’re in, in terms of those real events, and you have services that react to those events to go do something… So you have your whole architecture sort of in this flow of events… So to me, that is an event-driven style of architecture.

[06:05] Yeah, I think all this makes sense. But when people think about event-driven systems, they usually think of large microservices, deployments with many machines and services, a lot of different moving parts and complexity. So I think that that’s like a really interesting aspect of it to dive into. Microservices is a very mature domain, Chris, right? So I’d love to chat about how do event-driven systems help organize mutations of data and microservice-based systems.

Oh, yeah, I could say something about that. There’s sort of many different levels to this whole question about what’s an event-driven system, but one very specific way of looking at it is you have, say, a microservice architecture; it’s a set of services, right? Requests flow in, and some of those requests are local to a given service; those are just sort of – those are trivial to implement. But the really interesting ones are operations that are distributed across multiple services.

For example - there’s this example I’ve been using sort of ad nauseam for years now… To create an order - so that creates an order entity in the order service - you also have to reserve credit in the customer service, right? So customers have a credit limit. So the command “create order” actually has to perform two updates. And to implement a distributed operation like that, you have to use one of the service collaboration patterns. And in this particular case, the best fit would be to use the saga pattern, which implements a distributed operation as a series of transactions, local asset transactions in each one of the participating services. And you need to coordinate those transactions using some mechanism. And there’s actually two different coordination mechanisms. One is orchestration, but the other one we’re going to talk about here is choreography. And that’s where you use events. So each transaction actually updates some local business entity, creates an order. And then it would publish an event, saying “order created.” That would then trigger the customer service to reserve credit, which would then publish an event - interestingly, one of two events. Credit reserved, or credit limit exceeded. And then the order service would react to that and either approve the order, or cancel it, or reject it. So that’s an example of a choreography-based saga that’s using events to implement this distributed operation… So that’s one very specific kind of use case for events in a microservice architecture.

And the nice thing in what Chris, you said, is that the order servers and the customer service - they’re completely autonomous. They’re loosely coupled, and they’re just reacting to when things happen. So they’re not like temporally coupled and waiting for the service to say “Hey, I’m done or not done.” So to me, I really love that. I love for services to be autonomous. And I think that this is where events help bring in your microservices to be autonomous.

[09:59] Yeah. Interestingly, the word “coupling” actually in software has many different definitions, right? There’s multiple flavors of it. And you sort of touched on it with like temporal coupling. Another word for that is runtime coupling, which as you point out in that design, those services are decoupled from a runtime perspective. So the order service could actually create the order and send back an HTTP response saying “Hey, I’ve created the order. It’s pending. Here’s the order ID. Check back later to see whether it’s been approved or rejected. So that means that the order service can actually respond to that request without having to wait for the customer service to respond to it. And that’s really important, because if you have long chains of synchronous calls in a microservice architecture, it’s actually very brittle, and you risk having higher latency and lower availability. And using an asynchronous approach, like events, is one way to improve or decouple your services from a runtime perspective.

Yeah. That’s interesting though, we already got to coupling, and these kinds of concepts… Because I thought it was going to take us a lot longer to get there. But I think it’s interesting to compare choreography and orchestration in terms of coupling, in terms of safety, in terms of testability… Because I’ve actually worked with orchestration-based systems a lot more than choreography-based systems, and I’ve really enjoyed the way that you can see the process, the definition of the process, the flow of the process, as it goes from one service to another service, to another service. So you would be able to see and define that order moving from the customer to – what was your example, Chris?

Oh, it was ordering customer service. So on the one hand, that’s sort of a trivial example. But…

But let’s extend the example to be something a little more involved. Let’s say you want to book a trip, and your trip has a hotel, and also a flight, and also a train, or something like that. If you want to sort of like keep your sanity, it’s really helpful to say “Okay, you do this, then you do this, then you do this.” This imperative style of programming is much more familiar and much more debuggable for people than a pure event-based system where there’s choreography, and each service is relying on the next to pick up where it left off.

Yeah… Well, I guess what I would say briefly is there’s this concept, there’s this create order saga, or book trip saga, but with choreography there’s no explicit representation of it in the system. Some services publish some events, then another service has event handlers for it and they react to them. So you can’t look in the code and go “Oh yeah, that’s how the create order saga works”, right? Whereas, as you point out with orchestration, where you have a centralized orchestrator that is actually invoking the participants - whatever, book a hotel, book a flight, book a car - you literally have a class, shall we say, in your system that implements that orchestration logic. And so “Oh yeah, you can see what’s going on.” I mean, it’s explicitly represented in your code, and that’s really valuable, especially when it gets more complex.

[13:57] Just to pull us back for a second, I think, Chris, you’ve spoken about this, or you’ve kind of given the delineation in your examples… But I do want to – for those who maybe this is their first exposure, they’re coming to this podcast saying “What is event-driven architecture?”, could we just state explicitly the difference between orchestration versus choreography? Is it kind of, as you alluded to, Chris, that if you have some sort of - almost like a conductor in an orchestra, a place where you are outlining those explicit handlings, etc. versus the choreography, which is more step by step by step, and less like you see the whole relationship… I don’t know whether, Indu, you could give us a bit of a - if no one understands the difference between orchestration and choreography, how would you kind of describe it to them as someone new to this kind of thinking?

Okay, so the orchestrator is this – you kind of said it right, he’s the conductor; he or she, they know what the sequence is, so they direct the sequence. They send a message to the order service saying “Go and create this. Wait for a response”, and then say “Okay, now you need to send a message to another service.” So they control how the flow goes, and so it is sort of like this central – I mean, in the DDD world we call this process manager. So there’s this thing in the middle that orchestrates how response is, what should happen, react to responses etc.

Versus in the choreography world, it is truly asynchronous, and there is no person in the middle. Things and business processes sort of happen naturally. So that’s like the main difference. So when the order service – so, for example you’re buying a book on Amazon. So Amazon tells you your order was received; you get that. And perhaps the warehouse service is listening to that event, and trying to say “Oh, now I have to make sure to check I have the inventory.” Maybe you bought something that’s perishable. How do I – like, there are so many business rules that go into how should I send this item over. Meanwhile, there’s the billing or the payment side of things that needs to listen and see “Oh, there’s a new order. I better make sure the funds are settled properly.”

So all of these things are reacting independently, working on their own business constraints and rules, and publishing events. So yes, it is sort of difficult… So a lot of people, in my early experience, struggled with this sort of notion, like “What do you mean? This happens here, and this happens here, and together, this forms a business process.” Wouldn’t it be nice if there’s this thing in the middle that said where to go, what to do next?

So I think there are trade-offs. This is where for me systems thinking and that style of design comes into play. Your whole thing, your echo system is ultimately a system that’s trying to fulfill the user needs. And so regardless of what autonomous systems and services you have, they all need to communicate. So there needs to be a map of how those interactions work, and how you maintain it… But it’s about trade-offs. And in certain cases, maybe if you have a process manager, this is one thing that’s directing all the traffic. How is your concurrency? Are you going to run into concurrency issues? What is your load like, and how is your process manager working? Versus in the other case, this loosely-coupled system.

So I think you’ve got a way – you know your system better, and you’ve got to weigh in on what those trade-offs are; whether one scenario makes sense… And so it depends on the domain problem. Does that simplify a little bit?

[18:02] Yeah, I think it’s interesting to think about some example implementations, because you can sort of imagine a trivial implementation of a choreography-based system could be that you have one database table, and you just add [unintelligible 00:18:15.02] to it, and every service just checks if there’s anything new there, and then reacts to that. That would be probably the simplest to understand version of a choreography basis. I mean, there are many, many ways to implement, but that might be helpful conceptually. And then an orchestration-based system would be much more like you have one service that just makes a request every time that something needs to happen on another service.

I mean, there are undoubtedly lots of trade-offs… That’s kind of a key point, is there are a massive number of trade-offs, and it’s a giant sort of #itdepends in terms of what the best choice really is.

Could you dig a little deeper though, Chris? It’d be great, of like, what are some of those kind of trade-offs, possible kind of questions… If you’re kind of going into a room, a group of engineers - is an event-driven system the right choice for us? What should we use? What are the kinds of – if you could talk us through the questions to be asked, how would you evaluate those trade-offs…

Well, I think I want to just sort of touch on another concept issue. So to me, in an event-driven system, or shall we say systems that use events - they’re sort of a particular type of asynchronous messaging-based system. Maybe I’ll go up a level… So there’s two types of communication between services. The simplest one, obviously, is just to use REST or HTTP. That’s synchronous. A client, which might be another service, makes an HTTP request, waits for a response to come back. So that’s synchronous.

Then you have asynchronous systems, where services exchange messages. And messages might flow via a message broker, but there are brokerless messaging technologies as well. In particular, if you think about webhooks - that’s actually event delivery using HTTP. And maybe that’s more common outside of an enterprise, where third parties can register event notifications via a webhook mechanism; like whatever happens on a GitHub repo, or like when Twilio delivers an SMS message, that kind of thing. That’s basically event delivery via HTTP.

And then there are multiple types of messages. One type of message is an event. And so that represents something that has happened. And that’s kind of the true event-driven architecture. But then there are other types of messages; specifically, you can have a command message, which is actually a request to do something. And then possibly the recipient of that message sends back a reply message that contains the outcome of whatever it did.

So choreography-based systems, or choreography-based sagas use event messages for coordination, whereas orchestration-based sagas, they can use command/reply messages for communication. But everything is asynchronous here.

That’s a really interesting distinction. I just want to reflect on that for a second… The difference between the two, like the event and the command, is not in the technology, the way that it’s communicated or anything like that. It’s just “Is it something that happened, or is it describing what should happen next?” I hadn’t thought about it that way.

[22:02] Well, actually, a command is like “You do this.” So if we think about my customer and order example, in choreography the order service publishes an order-created event; that triggers the customer service to reserve credit. With an orchestration-based saga, the orchestrator would literally tell the customer service to reserve credit. It’s basically like an RPC or a method invocation that is packaged up as a message that would flow over a message broker. So it’s like, events are “Hey, I did this.” Commands are “Do this.”

So if you’re really pure about this distinction, you could say that an orchestration-based system is not an event-based system, because these are all commands; they’re not events.

Well, it can be a mix of commands and events, because your process manager, because it’s orchestrating, it might listen to an event from another system as well. So to touch on the difference… So commands and events, they both are messages. But when you’re designing them, it’s how you name them. Like Chris said, do something that’s like a direct order, that’s a command… So when you’re designing these messages, you can use that verb style, active verb style to name your messages, the commands. And events are always something that has happened in the past. So you name them in the past tense. And events are immutable statements of truth. So systems react to events that way; versus a command, when you explicitly order some service to do something, you have to expect that it can fail.

So process payment - you say the service should go reserve credit. That’s the example you gave, Chris. Maybe there weren’t funds, so you can’t go and do that reservation. So maybe that would fail. So I guess in this case the orchestrator or the software designer needs to think about what are the ramifications if this command fails, and also have logic to react or take compensating actions to those failures. So that also becomes the responsibility of your orchestrator.

And this is also where if you have two autonomous systems - or bounded context, in DDD terms - you can order another context to go do something. [laughter] So maybe if you have a set of microservices that all, say, belong to flight planning, or ordering, making payments, in that case definitely. You can tell a service within that boundary to go do something, because you’re also looking for failure, and reacting for it. This is also where I feel like if you’re trying to go cross your area of business capability into another area of business capability, then you communicate using events.

Yeah… But I kind of want to argue with that.

Because - I mean, clearly, there are lots of SaaS services, right?

Yeah, true.

…that have a command-based API. I don’t know, like say Twilio. I tell it to send an SMS message. I mean, I can’t invoke Twilio using an event, right? So I kind of reject your notion that you can’t tell a bounded context to do something. I mean, you have to be prepared for it to fail…

…but I can still tell you to do something.

[laughs] Yes, yes, you absolutely can.

You know, “Bring me a cup of coffee, please.” And you can go “No, I don’t want to drive for seven hours”, right? But other than that, I guess I can go “Oh, my coffee is finished.” Right? That’s an event.

I look at it as a heuristic more so as a rule, right? So in most cases - of course, it depends on the domain that you’re working with… Does it make sense for this orchestrator to go send a command. So it’s sort of this general heuristic I try to use. Of course, if there are cases where you absolutely have to, then sure. But this is something that I’ve used, and has helped me in the past. So again, I think it depends on the context of your problem. And for maybe a simpler scenario, like what you said, Chris - yeah, you don’t have to be so dogmatic. But it’s just a heuristic. Use it if it meets your needs.

Oh, this is gonna be fun… [laughter] Well, here’s a really interesting thing… Because I think you’re arguing that in a complex case, choreography might work better. And I actually think I would argue the exact opposite. So here’s an example. So if you think about this, if you actually dissect my customer and order example, so that you think about a more – I mean, it’s a trivial thing, but you think about a more complete example, where there are numerous events that the order service can publish, that must then cause the customer service to either reserve credit, or update credit, or release credit for a given customer. So what that actually means is the customer service has to be aware of all of the various lifecycle events that can occur within the order service… Which is kind of weird, right? Why do customers need to know about orders, or order lifecycle events?

And then if you contrast that with an orchestration-based approach, the customer service is merely told to reserve or release credit, and it doesn’t know why. The orchestrator knows why, because it’s implementing the operations like create order, edit order, order canceled, order shipped, order paid, so on and so forth. So you could argue that there’s less knowledge – that there’s more knowledge in the orchestrator, but there’s less knowledge in the customer service. They don’t have to know about anything, they just have to provide an API for managing credit… Which is actually kind of why when I publish an event in my system, I can’t actually expect Twilio to know that it has to send an SMS message.

Maybe one way to think about it is that you want one – like, in certain situations you want one system or one service to understand the full context of what’s going on, so that you can put your business logic there. So you can define your business process. Indu, you also mentioned this concept of some sort of like business process executor… Or what did you call it?

[29:52] Process manager… It depends on what your service is doing. Chri, you touched on “This service only does X. It doesn’t know about anything else.” In this case the saga is the one that’s directing and responding to all of the actions. But if you had services – I think about services as a set of services that help implement a business capability for the user, right? So in that case, it’s not a simple thing. There’s a lot of rules, and validations… And so this sort of context would be the context that’s responsible for all those rules. So when an event arrives - in the choreography model it’s just the interesting fact that something happened in the other context. But that might drive a whole set of rules and processes in this context. And so I just don’t like the interconnections or this context having to know state, or additional details… So that’s the stuff that I’m struggling with. How much of information should both these contexts know about each other?

Well, I think the number one thing to remember in software, design or architecture is “It depends.” And it’s almost like this sort of knowledge – there’s certain sort of complexities, or who knows about what, that must exist in a system for it to function. And then you have multiple ways of sort of designing that. And then it ends up being trade-offs, right? Like, should the customer domain know about orders? And you could say “Well, there might be certain advantages to that.” But then there’s downsides. Or do you centralize it? Which has some benefits, but then it may actually have a whole bunch of downsides. So it really does depend, and you kind of have to make these decisions on a sort of case by case.

So the way I think about it, which is - to go back to Angelica’s original question, which is how would I do this… So if we’re looking at implementing a command, if you just narrow it down to – you need to implement a command like create order or cancel order that updates things in multiple services, that needs to be a saga. So then you have to design the saga in sort of an abstract sense, like it’s a series of linear steps. And some of those steps have compensating transactions, which are invoked if a subsequent step fails and the compensating transaction undoes what was done previously… Which is just one of the complexities of sagas. So you end up with a series of steps, some of which have compensating transactions, and then the next step level down is “Okay, do I use orchestration, or do I use choreography?” So you end up with two candidate designs, and it’s like, which of those two designs has the best characteristics in terms of ease of understanding? …which I think is a huge differentiator between orchestration and choreography. And then you have to look at the particular kind of design time coupling there; who knows about what. And then just figure out which of the options is the least worst one.

[33:48] Yeah. And when we get to a certain scale… Like, I don’t know if you guys have noticed this, but as you add third-party dependencies to your system, you are forced to take a very methodical approach to interacting with them, to mock them out, to testing your interactions with them… And you never expect that third-party dependency to do anything different for you. I mean, you can ask for certain features, but you’re never going to say “I sent you an event. Go figure out what to do with it.” So at a certain scale, I think that internal systems benefit from being decoupled as if they are a third party. So being able to completely assume that another system is a black box that you can’t interact with gives you the ability to kind of cut out all of that complexity from your own system. And if you can do that internally, effectively, then it gives you theoretically infinite scalability.

Yeah… I mean, more and more I’m thinking that most software is exploitive, and anything of any complexity ends up being a giant mess, sooner or later. And I feel like one of the key reasons for that is that not enough effort is made to ensure that the system is comprised of easily understood, loosely-coupled parts.

Yeah. And the loosely-coupled part here is key. So that’s sort of what I’m saying - if you could take a part of your system and assume that it’s a third-party company that’s never going to respond to any of your requests for anything… It’s never going to call; you’re going to call it, and it’s like a totally independent system. It really changes the way that you think about building systems.

Yeah… I was thinking, like you said, Chris, there’s trade-offs. And also how you design these things. You can use the orchestrator, you can use the choreography, you can do both. So let’s say that in the process manager – so you’ve shipped off this thing, right? So the process manager [unintelligible 00:36:14.14] events, and you’re good. Now, let’s say business has a new requirement that says “Hey, we want to track our customers who are interested in this product, or customers in this area, that we ship to.” Something, an interesting requirement that comes in. Now, do you go and modify your existing process manager? You could. You could go and add logic after the thing got shipped, go do this other thing. Or in your choreography model, you have an event that says “This got shipped.” So you could have a listener that listens to that event and does that extra bit of logic, or keeping track of customers, or something that was interesting for marketing. And that is completely autonomous. And I think this is where, for me, the power of choreography comes in, because now you have these – you can write your services that sort of align with your business needs. So in one model you would go and change your process manager, making changes. In the other, you just introduced a new service that just consumes this event, does its thing; you don’t have to touch your process manager… So I think there’s a balance, right? There’s a balance in - yeah, your business requirements, your needs, and where does choreography fit better, where does process manager fit better… And just because you have process manager doesn’t mean you can’t use choreography. So it’s all about trade-offs. Yeah.

[38:08] Yeah. You raise a really interesting point. In a sense, what you described is the application of the open/closed principle, which is you want a service or a system to be open to extension and close to modification. And events are actually a good way of doing that. So maybe the way I view it is perhaps even if the core logic of creating an order or implementing some other operation is best implemented using orchestration, for unanticipated needs, or sort of super-loosely – just other things that you might want to bolt on afterwards, that are not part of the core responsibilities, then publishing events as well… So the order – so even if you had orchestration, the order service and customer service can publish events. And then to do other things that are unrelated to creating an order, or managing orders, say, other services could just listen to those events and do whatever they want. And the order service doesn’t care, and the customer service doesn’t care either.

Yeah, that’s a great point. Actually, at Anchorage Digital we use orchestration a lot, and there have been some cases recently where I thought “Hey, I wish that I just had an event for this. It would be so much easier to just add a little extra thing here that reacts to another thing.”

Yeah. So it’s not choreography, right? The operations can be implemented using orchestration, but to just provide the “hooks” for other interested parties to observe what’s happening - yeah, you can just publish events as well.

So to kind of bring us a little bit into the weeds, because I do want to get us a little bit more in the weeds, as we’ve talked through the high level, like what are the pros, the cons, orchestration, choreography, etc. how do you bring this kind of overarching “What do we do? How do we implement this?” from kind of the drawing board, whether it be a mirror, an architectural diagram, to actual code, technology? How do you decide – open source packages might be useful, a language, technology… How do you bring the conversation from lines and squiggles diagram [unintelligible 00:40:40.05] implementation to actual “This is now a thing in the wild”? I don’t know whether maybe there are technologies that you think work really well, there are ways to think about implementation, gotchas… Just kind of opening the floor to whatever comes to mind.

I can start with the most obvious thing… If you are already using a cloud vendor, then they probably have a lot of these tools. So for example, we use Google Cloud. And Google Cloud has a pub/sub system; if you want to do things with events, it’s a great choice. It supports many different paradigms. You can process them sequentially, synchronously, asynchronously, you can have [unintelligible 00:41:23.01] have all these different things from a single hosted service. They even have a workflow system. It allows you to define steps in YAML, and it triggers your services, and so on. So it’s always a good idea to check “Okay, what’s already available? What already exists?” And maybe in your context, you have on-prem deployment, and somebody is already running Kafka… Okay, great. Latch onto that existing system and start building around it. It’s usually much more expensive to try to bring in a new tool if you already have some.

[42:07] Yeah. I mean there are frameworks, technologies that are out there. [unintelligible 00:42:12.21] using Temporal, and I have my own open source framework, Eventuate. And part of it does depend on exactly, like you say, if you’re just using choreography-based sagas, you want to have your services publish events when they update business objects. I mean, you don’t necessarily need much of a framework, apart from one caveat that I’ll get to in a minute; you can just publish events, you could just pick your favorite Message Broker and publish events to it. But one really interesting thing is you want to make sure that your database updates and message sending is done atomically… Because either you create an order and publish an order created event, do those two things, or you do neither of those things. And if you did one or the other, your system would be in an inconsistent state. And I suspect a lot of applications which are kind of susceptible to this vulnerability of some kind of failure occurring, which would prevent both of those things from happening.

So one really useful pattern that you should, if you’re doing this by hand, implement at a minimum is the transaction outbox pattern, which is where rather than sending a message to the Message Broker directly, you have an outbox table in the database that is updated as part of the database transaction that updates or creates/updates the business entity. So you would insert into the auditable, and then you would insert into the outbox table. And that because of data – well, assuming you’re using a relational database, that happens atomically. So you’ve got that guarantee.

And then there is a separate process, the outbox server that is pulling messages out of the outbox and sending them to the message broker. And that’s kind of like a key sort of foundational pattern to make sure that your asynchronous architecture is actually sort of resilient. And without that, you risk inconsistencies.

Yeah. So when I worked with .NET, writing services, I used NServiceBus. So there’s open source platforms, MassTransit and then NServiceBus etc. that implement the saga patterns; they make it easier to consume messages, and also what Chris talked about, which is really important, the outbox pattern for consistency and data integrity, really. So I have used NServiceBus. I also used to work for particular software – I used to be part of the crew implementing these patterns. So that’s really huge. So if you’re not using something out of the box, like NServiceBus or MassTransit, those things I have to have an outbox pattern, and retries, and transient failures, and things like that. Those are really key.

Yeah. And then if you’re doing orchestration - I mean, essentially, there you’re implementing a state machine that’s keeping track of where in the flow it is, and then sending messages, and responding to replies. That’s much more elaborate, and that’s when you want to use some kind of orchestration framework, whether that’s Eventuate, Temporal, or I guess NServiceBus, you name it, you need to use something.

Interestingly, there are other sort of foundational patterns as well, like the message broker might deliver messages multiple times, and so you need to implement the idempotent consumer pattern. So there’s a whole bunch of stuff that below the level –

[46:13] And messages…

What’s that?

Messages arriving out of order…

Well, yeah, depending on your needs… That’s a really interesting point, because I think for me a really useful design technique is rather than go from sort of “Oh, we’re going to use messaging” to “We’re going to use Kafka” or a particular message broker, I like to use the pattern language from Enterprise Integration Patterns by Gregor Hohpe, where you have a concept of a message channel that’s an abstraction over whatever messaging capabilities the particular message broker… So you kind of design it so that your services are communicating using channels. And then you identify the requirements for a given channel, which might be like latency throughput, delivery guarantees, whether it needs to support ordering, and so on and so forth. And then you map that to - you know, for each channel you then go and pick the messaging technology that best fits the requirements. And that could be something that runs on the cloud, though the cloud-based ones are kind of weird, like SQS, and stuff. Super-scalable, fully-managed, but maybe high latency. And then there’s Kafka, which has a particular set of guarantees and characteristics… And then there’s lower-latency mechanisms… Heck, you could even use Redis and its in-memory messaging channels, if you want to, or streams. So there’s a whole bunch of options, and different channels could use different kinds of messaging technologies.

And given the amount of choices that there are, what is the kind of, I guess, trade-off risk difficulty associated with changing your implementation? Say you implement your system and you realize “Oh no, I’ve implemented this using orchestration. Actually, choreography is the better choice.” Is it kind of a “Think really carefully about which you choose at the outset”? Or is there flexibility to switch, or to change out? I really want to understand the trade-offs there. How big a deal is deciding which path to take at the beginning of architecting a system?

So I’m gonna throw in another library here called Watermill. I’ve noticed that – I haven’t actually used it very much myself, but I’ve noticed that it, and many others probably, provide an abstraction around events. So you can write your code essentially just thinking about the events, and plug into any system that allows you to deliver events, even in-memory, or a variety of different options.

Oh, what was the name of that?

It’s called Watermill.

Oh. I assume it’s a Golang library.

It’s a Go library. So I’m bringing it to Go a little bit here… If you have something like that, then it does allow you to kind of like switch out implementations. That being said, like you said, Chris, different implementations have different trade-offs: latency, throughput, all that. I think at a low scale it really doesn’t matter which one you choose, and eventually you will realize that maybe you made the right or wrong decision, and then you might have to change that up.

[49:41] Yeah, I think the right or wrong - I think that’s an interesting way to look at it. I think more the businesses, companies realize software is evolving. It should evolve. And you make a decision - it might have been the best decision based on the data at the time and the trade-offs that you made… But businesses grow, and change, and market conditions might affect it… And so once the technical people folks, product managers realize that, I think companies should be open in changing that, evolving that. To me, it’s a continuous process. So if we ultimately want to design software that aligns with the business needs, it means constantly changing.

You might have designed your event a certain way, but then maybe in the domain it’s being called differently by the domain expert. So now you know that language. What do you do? So you go and change the schema. Yes, there’s a cost attached to it, but you make that change.

So I think when organizations realize how important this is, and change, and evolution becomes a part of the culture, I think we wouldn’t have to worry too much about making it all right at the very beginning. You make the best decisions based on the trade-offs, and you continuously try to improve and evolve.

Yeah. And actually, I think what’s more likely to change than your sort of like queue messaging system, or queue system, or whatever is the actual structure of your events, the actual business process yet, how do systems interact. That changes much more often than switching out infrastructure. And so that kind of evolution requires a whole other set of patterns that you have to think about. Do you publish multiple events during that transition? Which service do you deploy first or second? When is it safe to make a particular change? And that could add a lot of complexity.

With a framework or a system that’s more opinionated, it might be easier, because the framework or system or workflow engine defines how to make those changes. So if you’re using some YAML-based workflow engine, it will essentially preserve the old version of your workflow, but let those workflows finish executing, and then the next iteration, the next time it runs it will pick up the new version of your workflow.

We’ve been using Temporal – well, let me give a quick description of Temporal. Temporal allows you to essentially write an orchestrator using your language of choice, executing each step of it. They call it an activity. Essentially, calling out to different services to run the different activities. So it keeps track of the state of your workflow. And it has its own versioning system, so you can upgrade your workflow incrementally. The new version of the workflow will pick up the change, the old versions will continue to execute, and so on. So as long as you have a strategy, as long as your team has an understanding of how to do the upgrades, how to make these changes, it works out. You get in trouble if you haven’t thought about the upgrades, how do you actually change these systems. Because it’s very easy to break them if you don’t think about it.

So regrettably, we do not have all that much time left, so what I’d like to do is go around and see if any of you have any final thoughts, final seeds, final things that any listeners should think about or look into. We may have to do another episode to continue this conversation, but before that, final thoughts. Indu, what is your final thought?

I think event-driven style of architecture is one way of looking at your problem domain… And it’s not a dogmatic approach; use it where it makes sense and look for the trade-offs.

[53:49] I would say one way or another your backend systems will end up having more than one service. Sooner or later, it happens. And sooner or later, you have multiple storage systems, multiple third parties, all kinds of things to synchronize and coordinate. And it’s important to take into account all of your options when you start doing this. So all the things we’ve talked about, like choreography, orchestration events, commands - it’s important to consider the options before diving in and starting to do something. It’s very common for people to just sort of pattern-match based on something they’ve seen before, or they know one system and they just use it for everything. But it’s worthwhile to learn all the different ways of building systems, so that you can use the most appropriate thing.

Yeah, I sort of agree with Viktor. The answer to all of these design and architectural problems is that it depends. And the trouble with that is that it requires thinking. And recently I read this really good paper, which is like “Architecture is a series of design decisions.” And kind of just the idea that, okay, you’ve got a problem to solve; you want to clearly define the problem, you want to figure out the criteria that define the goodness of a solution, and then you want to think about possible solutions. You want to evaluate them with respect to each of those criteria, and then you want to pick the best one. And you make a decision, and that results in a modification to your architecture. And you just keep doing that over and over again. And that’s how you build an architecture, that’s how you evolve in architecture.

And so at every stage, you’re actually thinking about what the criteria are, evaluating the various options, and then picking the best or the least worst one. And while at the same time remembering that you want to have systems that are as simple as possible, as loosely-coupled as possible, have a high degree of abstractions of high complexity behind stable APIs. And it’s hard, but it’s sort of necessary.

Well, thank you very much. I appreciate your all’s time. And before I let you go, we’re gonna jump into what is arguably my favorite part of the episode, which is unpopular popular opinions.

So, Viktor, what is your unpopular opinion?

Yeah, let’s apply some design patterns to food… My opinion is that burritos are better than lobster. So essentially, a burrito is a container of your entire meal. It’s packaged in a very convenient way, very portable, easy to eat, you know where to start, no work required. You have to unwrap it sometimes. And lobster is kind of on the other end of the extreme, where it’s arguably not food, it’s mostly a shell, and it takes a lot of effort to get anything out of it. And even then, that flavor doesn’t really come from the food, it comes from all the other things around it. So it’s really a bit of a missed opportunity there for having real food.

You know, it’s funny that you mentioned burritos, because I think two or three years ago I gave a talk about design time coupling in software. And the example I used was a food delivery application, where they had to enhance it to support ordering customized menu items. And I used the example of a burrito. And that was the hardest presentation to work on, because every time I looked at the image of the burrito in the slides, I was instantly hungry, and I had to go – and I wanted to just order a burrito. And you have just done that to me now. I want to get a breakfast burrito. [laughter]

[58:19] But when you look at a lobster, does that make you hungry?

No, but a burrito does.

I think you’re hitting on a popular unpopular opinion there, Viktor…

I’m trying to make it popular, as it is the convention with this show. It becomes popular by the time I finish explaining it. But I think that if you ask a lot of people, they will tell you that yes, lobster is expensive, and it’s good because it’s expensive, and it’s good because it takes so much effort to eat.

I think when you are the lobster, Viktor, you forgot the butter… [laughter]

Maybe my unpopular opinion – because I didn’t realize this had to be humorous.

No, it can be unpopular funny or not.

I just want to say that I really think coffee has two ingredients in it: ground beans and water. [laughter].

No milk?

No sugar?!

Well, unless it’s like a cortado… But no sugar, really. No pumpkin. That’s just wrong. [laughter]

For those of you who didn’t hear, we were talking earlier during soundcheck and I talked about how I had a pumpkin spice latte. And the look of disdain and disgust, along with a sound effect of pure just like “UGH!” from Chris that I got was illustrative to me that I should never mention pumpkin spice lattes in his presence again.

And sort of the croissant part, too. I think croissants are all–

Oh yes, I did have a peach croissant.

…a work of art. And a good croissant is so rare, and you should just eat it with butter and maybe jam.

Angelica, next time I meet Chris I’m going to not tell him what’s in his coffee, but he might –

Pumpkin spice latte.

Do it. [laughter]

Just for you. Just for you.

Thank you. I appreciate that. Well, Chris, would you like to have another unpopular opinion? Or is that your unpopular opinion?

No, and I apologize for re-yucking your yum… Because everyone has their own tastes, and I hate it when someone criticizes mine, so I’m being a hypocrite.

I don’t mind. If you don’t drink them, there’s more in the world for me.

True, true.

That’s great. So Indu, do you want to give us your unpopular opinion?

Yeah, I’m gonna not have coffee jokes, or lobster jokes… But I think my unpopular opinion is as techies, we love technology, we love solving, knowing what the solution is before we actually take the time to explore the problem better. And as we work in complex domains, I think that’s more of a skill that we all need to hone. So yeah, I think we love technology more than actually trying to figure out what we’re trying to solve for.

[01:01:14.01] Yeah, and maybe this is my unpopular opinion which builds on Indu’s thing, actually. This is a thought that occurred to me the other day, is right now there’s a fixation on developer platforms. And interestingly, it’s become this thing with the Team Topologies book, which - to me, the focus of Team Topologies is about people. But there’s just – I know they have platform teams, and that’s definitely a good thing, but I just keep seeing platforms, platforms, platforms. And maybe one of the motivations for that is like “Well, it’s platforms. That’s technology. It’s nice, tangible stuff. And we’re engineers, and we like to deal with technology.” And then at the same time, you can be a vendor, and you love – you need things to sell. So platforms - that’s a thing you can sell. But I sometimes feel that in some organizations people are going to do platforms engineering as a substitute for actually solving messy human problems. Like, having the right organizational structure, having the right development process, having actually something that is truly agile, not just sort of fake agile, right? You know, autonomous teams, and so on. So I feel like there’s some really negative reasons behind this fixation on platforms.

I mean, computers do exactly what you tell them. People don’t do that, unfortunately/

Yeah, people are messy. But software is built by people, and you have to get the people parts right, the right organizational structure, process, and so on, if you want to actually deliver good software. And it doesn’t matter how much technology you throw at the problem. If those people problems are not solved, it’s just going to be a mess.

Yeah, that’s very unfortunate when you start solving a problem and you realize that it’s not a technical problem, and you have to get into “Okay, how are people talking to each other? You guys want to have a meeting about this? Okay, let’s do a talk. Let’s do a training.” It’s a lot of effort. It’s more effort than a few lines of code.

Well, on that very intriguing thought on the importance of interacting with your colleagues, which I guess everyone does… [laughs] We unfortunately are going to have to leave it there for now. But thank you so much for all coming on. Really appreciate your being such intriguing, inquisitive guests, and I hope to see you all again soon.

This was great. Thank you.

Thank you. Thank you for having me here.

Thanks for hosting.

Our pleasure.


Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00