Distributed systems are hard. Building a distributed messaging system for these systems to communicate is even harder. In this episode, we unpack some of the challenges of building distributed messaging systems (like NATS), including how Go makes that easy and/or hard as applicable.
Derek Collison: I think thatâs a great question⌠And like I said, I took a stance for probably almost 20 years, of âOh no, just use whatever youâre gonna use.â I really wouldnât even engage with people, because it would be a two-hour conversation. I would still probably lose the argument at the end, so to speak.
Whatâs happening now is that things are kind of coming back to it. So just to level-set - one of the things that I care deeply about with messaging systems is a couple things⌠And itâs not about messaging, especially not about the message broker from the â90s type stuff; we need to not think of it like that. We need to think of it as a ubiquitous technology, kind of like network elements.
I was around when there was a single cable, and if you ever kicked the terminator cap off the end, it took the whole network down⌠And I saw the birth of hubs, and switches, and smart switches, and top of rack, and all the crazy network elements that we have⌠And just as well, a modern system needs to be doing that similarly. But to start out with - the first thing it does is it says âI am gonna do addressing and discovery not based on an IP import, but on something else.â You can call it a channel, a topic, a subject - I really donât care what you call it.
Now, again, for a while, people were like âWhy? This doesnât make any sense.â But in todayâs cloud-native world, I would argue what everyone is doing today makes no sense. We struggled so hard to change servers from pet to cattle, and yet weâre still saying âOh, I wanna talk to Johnnyâs service, so I need to know the notion of an IP port.â Now, I know what the audience is probably thinking, and Iâll walk through an example of what this really looks like in practice.
The other thing too is that messaging systems, when they do that abstraction, a lot of people call it pub/sub. And we still call it pub/sub, but again, weâve gone away from that, because itâs kind of got a bad rep. But what I mean by pub/sub is that if the technology can understand multiple messaging patterns - one to one, one to N, m to N, and then one to one of N, meaning I can automatically say âHey, I wanna send a message and only one of you in this set will receive it.â Thatâs kind of what NATS does at the very basic levelsâŚ
And folks always ask and they go âOh, I just donât need any of that stuffâ, and I say âOkay, a couple things⌠One, decoupling is good.â Pets versus cattle was legit; letâs make sure our connected technologies follow the same path, and donât say âNo, no, this oneâs specialâ, whether itâs an individual server, or a load balancer, or a special host⌠It just makes no sense in a modern world. So we push down on those things and we say âOkay, got it.â
[16:18] The last piece of advice I always give people from the â90s is âNever assume what a messageâŚâ - and a message could be a request, it could be data events, all kinds of stuff. But ânever assume what the message is gonna be used for tomorrow.â So everyone kind of looks at me and says âWhat does that mean?â and I say âOkay, Iâll give you a really simple example.â When we talk about NATS these days, with a very fortunate, growing user base, and all kinds of crazy interest from customers and clients these days, is - modern architectures, distributed architectures are built using connected patterns, and thereâs really only two. Thereâs lots of nuances to it, but thereâs two. Itâs either a stream or itâs a service.
The service is I ask a question, I get an answer, and a stream is Iâm just sending a piece of data, and then I donât care. And to level-set, distributed systems, even up to a couple years ago, were dominated - 98%+ of everything was a service interaction. Iâm not saying it has to be synchronous, but everything was âIâm asking a question and getting an answer.â HTTP is âI ask a question and I get an answerâ type stuff. So I said âFine, on day one you know whoâs gonna be answering that question.â So you coded up so that Iâm gonna send a question to Mat, Matâs gonna respond back with an answer. And youâre doing it on your laptop, and you use HTTP, or gRPC, or whatever, and youâre like âThatâs all I need.â I go âGreat.â
Letâs not even get to the point of anyone else being interested in the message. Just to start with âOkay, letâs go to production.â Well, we need more than one map. Oh, crap. Now we need a load balancer. Well, now we need to put stuff in, and do DNS⌠And thatâs fine; production can handle that. I donât have to worry about that. So then they do health checks, and then they have to figure out rerouting, and all this stuff⌠And all these big companies have playbooks on exactly how they do it, and they all look very similar, but theyâre all slightly different.
Now letâs say someone says âHey, for compliance we need to be able to watch all requests and we need to record them.â Now all of a sudden youâre gonna have to put a logger, and you donât want the logger in place of a request-response, which is a massive anti-pattern I see being proliferated these days. Itâs like âOh, no⌠Put a messaging system in between it, and store it on disk, and try to retry, and stuff like that, in line with the microservice, in line with the service interaction. Iâm hyped up on Red Bull, but thatâs the dumbest thing Iâve ever heard. Itâs like Google saying âWeâre gonna write all the log entries before we return your search results.â Itâs just foolish, no one would ever do that. But thereâs a need to say âHey, I need to be able to write these things down, and someone else is gonna look for anomaly detection, or any type of policy enforcement, whatever it is.
So look at that system when youâre getting to the end of the day, and now letâs say we actually wanna spread it out from East Coast to West Coast, to Europe, and you need Anycast, and DNS, and all kinds of crazy stuff and coordination on the backend of state. They become really complicated, and theyâre trying to get around the fact that everything is just point-to-point; it is naturally a one-to-one conversation. Whereas with a messaging system, you write it, you run it in that server, letâs say, or whatever, but think of the NATS server is extremely lightweight. It can run on a Raspberry Pi, in your top of the rack switch, you can run it in your home router, you can plug and play and put these things together in any arbitrarily complex topology that spans the globe⌠And thatâs another discussion on distributed systems that arenât really distributed.
So you set it up on your laptop and you run the NATS server. Its Docker image has been downloaded 150 million times⌠It just runs. For example, a subsystem of GE doing nuclear stuff runs our server for two years at a time, with no monitoring, no anything. And when they come in to change things inside of that nuclear reactor type stuff, they shut it down and figure out if they wanna upgrade it or not.
[19:59] So itâs lightweight, itâs always there, itâs ubiquitous, it just works. So now all of a sudden you write the same program, youâre sending a request, getting a response, doing it on your laptop⌠I would argue youâll take the same amount of time, or possibly less, but itâs on the same level. You do have to run the server.
But now when you go to the next level of production - I need more mats. Well, just run more NATS â I mean, maths, not NATS. âOh, well do I have to put in deployment a framework like Kubernetes and a service mesh?â No. I donât care how you run them. Run them on bare metal, in a VM, in a container, in Kubernetes⌠It does not matter, run them anywhere. The system automatically reacts.
By the way, you havenât configured anything on a NATS server yet, ever. Now, all of a sudden youâre like âOkay, well what happens if I wanna do compliance and I need to watch all the requests coming in?â Just start a listener for it. Itâs got all of those patterns built in, so thereâs nothing changing between me whoâs asking the request and Mat whoâs giving the response; we have to change. As a matter of fact, we donât even have to be brought down and restarted; weâre just running along and people can come up and bring up anomaly detection, logging⌠All kinds of stuff.
So as you keep making these systems more production-ready and complex, you realize that messaging gives a huge win over the other ones. Now, the other ones are kind of known. People know how to put up load balancers, and know how to do logging, and know how to do all this stuff⌠But when you see something running on a service mesh and you havenât even sent a single request check and youâre spending $6,000 a month⌠And I can show you, we can do 60,000 requests a second, with all of the servers latency tracking, all transparently, doing it for real, and it runs on a Raspberry Pi⌠That also translates to OpEx savings, which is a big deal.
NATS has always been known for how fast it is, but most people tell me âBut we donât need to go that fast. We donât need to have a server doing 80 to 100 million messages a second.â And I go âI knowâ, but if you think about it, if you take that same thing for your workload and put it in the cloud, you can save 80% on your OpEx budget.
So do people need messaging systems to build stuff? Of course not, because everything for the most part is built on essentially HTTP, which again, is an interesting one to me, unpopular opinion⌠But I know why we did it that way, and we donât have a reason to do it that way anymore and get it stuck, right?
The notion of client-server or request/response in the old days was the requester and the responder were usually inside the same network firewall⌠Not firewall specifically, but essentially weâre inside the company. And everyone started to say âHey, I want the requesters, whatever those things are, to be able to walk outside of the corporate network.â So all of a sudden people started doing this, they go âWe canât get through the firewallâ, and the firewall people, or the old DB people, they go âNo.â It doesnât matter what you ask them, they say no.
So people kind of went âWait a minute⌠Port 80 is always open. We can piggyback off that and circumvent this whole thing. So we can just do request-response on HTTP or HTTPS, and it works.â And itâs true, and I remember doing some of those tricks myself. Weâre not in that world anymore; that makes no sense whatsoever to build a modern distributed system off of technology that existed for something totally different and was a workaround around security and firewall systems.
Break: [23:23]