Distributed systems are hard. Building a distributed messaging system for these systems to communicate is even harder. In this episode, we unpack some of the challenges of building distributed messaging systems (like NATS), including how Go makes that easy and/or hard as applicable.
Derek Collison: I think thatās a great question⦠And like I said, I took a stance for probably almost 20 years, of āOh no, just use whatever youāre gonna use.ā I really wouldnāt even engage with people, because it would be a two-hour conversation. I would still probably lose the argument at the end, so to speak.
Whatās happening now is that things are kind of coming back to it. So just to level-set - one of the things that I care deeply about with messaging systems is a couple things⦠And itās not about messaging, especially not about the message broker from the ā90s type stuff; we need to not think of it like that. We need to think of it as a ubiquitous technology, kind of like network elements.
I was around when there was a single cable, and if you ever kicked the terminator cap off the end, it took the whole network down⦠And I saw the birth of hubs, and switches, and smart switches, and top of rack, and all the crazy network elements that we have⦠And just as well, a modern system needs to be doing that similarly. But to start out with - the first thing it does is it says āI am gonna do addressing and discovery not based on an IP import, but on something else.ā You can call it a channel, a topic, a subject - I really donāt care what you call it.
Now, again, for a while, people were like āWhy? This doesnāt make any sense.ā But in todayās cloud-native world, I would argue what everyone is doing today makes no sense. We struggled so hard to change servers from pet to cattle, and yet weāre still saying āOh, I wanna talk to Johnnyās service, so I need to know the notion of an IP port.ā Now, I know what the audience is probably thinking, and Iāll walk through an example of what this really looks like in practice.
The other thing too is that messaging systems, when they do that abstraction, a lot of people call it pub/sub. And we still call it pub/sub, but again, weāve gone away from that, because itās kind of got a bad rep. But what I mean by pub/sub is that if the technology can understand multiple messaging patterns - one to one, one to N, m to N, and then one to one of N, meaning I can automatically say āHey, I wanna send a message and only one of you in this set will receive it.ā Thatās kind of what NATS does at the very basic levelsā¦
And folks always ask and they go āOh, I just donāt need any of that stuffā, and I say āOkay, a couple things⦠One, decoupling is good.ā Pets versus cattle was legit; letās make sure our connected technologies follow the same path, and donāt say āNo, no, this oneās specialā, whether itās an individual server, or a load balancer, or a special host⦠It just makes no sense in a modern world. So we push down on those things and we say āOkay, got it.ā
[16:18] The last piece of advice I always give people from the ā90s is āNever assume what a messageā¦ā - and a message could be a request, it could be data events, all kinds of stuff. But ānever assume what the message is gonna be used for tomorrow.ā So everyone kind of looks at me and says āWhat does that mean?ā and I say āOkay, Iāll give you a really simple example.ā When we talk about NATS these days, with a very fortunate, growing user base, and all kinds of crazy interest from customers and clients these days, is - modern architectures, distributed architectures are built using connected patterns, and thereās really only two. Thereās lots of nuances to it, but thereās two. Itās either a stream or itās a service.
The service is I ask a question, I get an answer, and a stream is Iām just sending a piece of data, and then I donāt care. And to level-set, distributed systems, even up to a couple years ago, were dominated - 98%+ of everything was a service interaction. Iām not saying it has to be synchronous, but everything was āIām asking a question and getting an answer.ā HTTP is āI ask a question and I get an answerā type stuff. So I said āFine, on day one you know whoās gonna be answering that question.ā So you coded up so that Iām gonna send a question to Mat, Matās gonna respond back with an answer. And youāre doing it on your laptop, and you use HTTP, or gRPC, or whatever, and youāre like āThatās all I need.ā I go āGreat.ā
Letās not even get to the point of anyone else being interested in the message. Just to start with āOkay, letās go to production.ā Well, we need more than one map. Oh, crap. Now we need a load balancer. Well, now we need to put stuff in, and do DNS⦠And thatās fine; production can handle that. I donāt have to worry about that. So then they do health checks, and then they have to figure out rerouting, and all this stuff⦠And all these big companies have playbooks on exactly how they do it, and they all look very similar, but theyāre all slightly different.
Now letās say someone says āHey, for compliance we need to be able to watch all requests and we need to record them.ā Now all of a sudden youāre gonna have to put a logger, and you donāt want the logger in place of a request-response, which is a massive anti-pattern I see being proliferated these days. Itās like āOh, no⦠Put a messaging system in between it, and store it on disk, and try to retry, and stuff like that, in line with the microservice, in line with the service interaction. Iām hyped up on Red Bull, but thatās the dumbest thing Iāve ever heard. Itās like Google saying āWeāre gonna write all the log entries before we return your search results.ā Itās just foolish, no one would ever do that. But thereās a need to say āHey, I need to be able to write these things down, and someone else is gonna look for anomaly detection, or any type of policy enforcement, whatever it is.
So look at that system when youāre getting to the end of the day, and now letās say we actually wanna spread it out from East Coast to West Coast, to Europe, and you need Anycast, and DNS, and all kinds of crazy stuff and coordination on the backend of state. They become really complicated, and theyāre trying to get around the fact that everything is just point-to-point; it is naturally a one-to-one conversation. Whereas with a messaging system, you write it, you run it in that server, letās say, or whatever, but think of the NATS server is extremely lightweight. It can run on a Raspberry Pi, in your top of the rack switch, you can run it in your home router, you can plug and play and put these things together in any arbitrarily complex topology that spans the globe⦠And thatās another discussion on distributed systems that arenāt really distributed.
So you set it up on your laptop and you run the NATS server. Its Docker image has been downloaded 150 million times⦠It just runs. For example, a subsystem of GE doing nuclear stuff runs our server for two years at a time, with no monitoring, no anything. And when they come in to change things inside of that nuclear reactor type stuff, they shut it down and figure out if they wanna upgrade it or not.
[19:59] So itās lightweight, itās always there, itās ubiquitous, it just works. So now all of a sudden you write the same program, youāre sending a request, getting a response, doing it on your laptop⦠I would argue youāll take the same amount of time, or possibly less, but itās on the same level. You do have to run the server.
But now when you go to the next level of production - I need more mats. Well, just run more NATS ā I mean, maths, not NATS. āOh, well do I have to put in deployment a framework like Kubernetes and a service mesh?ā No. I donāt care how you run them. Run them on bare metal, in a VM, in a container, in Kubernetes⦠It does not matter, run them anywhere. The system automatically reacts.
By the way, you havenāt configured anything on a NATS server yet, ever. Now, all of a sudden youāre like āOkay, well what happens if I wanna do compliance and I need to watch all the requests coming in?ā Just start a listener for it. Itās got all of those patterns built in, so thereās nothing changing between me whoās asking the request and Mat whoās giving the response; we have to change. As a matter of fact, we donāt even have to be brought down and restarted; weāre just running along and people can come up and bring up anomaly detection, logging⦠All kinds of stuff.
So as you keep making these systems more production-ready and complex, you realize that messaging gives a huge win over the other ones. Now, the other ones are kind of known. People know how to put up load balancers, and know how to do logging, and know how to do all this stuff⦠But when you see something running on a service mesh and you havenāt even sent a single request check and youāre spending $6,000 a month⦠And I can show you, we can do 60,000 requests a second, with all of the servers latency tracking, all transparently, doing it for real, and it runs on a Raspberry Pi⦠That also translates to OpEx savings, which is a big deal.
NATS has always been known for how fast it is, but most people tell me āBut we donāt need to go that fast. We donāt need to have a server doing 80 to 100 million messages a second.ā And I go āI knowā, but if you think about it, if you take that same thing for your workload and put it in the cloud, you can save 80% on your OpEx budget.
So do people need messaging systems to build stuff? Of course not, because everything for the most part is built on essentially HTTP, which again, is an interesting one to me, unpopular opinion⦠But I know why we did it that way, and we donāt have a reason to do it that way anymore and get it stuck, right?
The notion of client-server or request/response in the old days was the requester and the responder were usually inside the same network firewall⦠Not firewall specifically, but essentially weāre inside the company. And everyone started to say āHey, I want the requesters, whatever those things are, to be able to walk outside of the corporate network.ā So all of a sudden people started doing this, they go āWe canāt get through the firewallā, and the firewall people, or the old DB people, they go āNo.ā It doesnāt matter what you ask them, they say no.
So people kind of went āWait a minute⦠Port 80 is always open. We can piggyback off that and circumvent this whole thing. So we can just do request-response on HTTP or HTTPS, and it works.ā And itās true, and I remember doing some of those tricks myself. Weāre not in that world anymore; that makes no sense whatsoever to build a modern distributed system off of technology that existed for something totally different and was a workaround around security and firewall systems.
Break: [23:23]