Distributed systems are hard. Building a distributed messaging system for these systems to communicate is even harder. In this episode, we unpack some of the challenges of building distributed messaging systems (like NATS), including how Go makes that easy and/or hard as applicable.
DigitalOcean – DigitalOcean’s developer cloud makes it simple to launch in the cloud and scale up as you grow. They have an intuitive control panel, predictable pricing, team accounts, worldwide availability with a 99.99% uptime SLA, and 24/7/365 world-class support to back that up. Get your $100 credit at do.co/changelog.
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.
Play the audio to listen along while you enjoy the transcript. 🎧
Hello, and welcome back to Go Time. To those of you tuning in week after week to support us, thank you. And thanks to those of you joining us for the first time. Be sure to go back and listen to our previous episodes; lots of goodies in there.
Did you know you can listen to this show live, as it’s recorded, every Tuesday, 3 PM Eastern? Yup, that’s right. Head on over to GoTime.fm for details. So a shout-out to our live listeners currently, and on the Go Time channel on Gopher Slack. Your participation surely makes for a better show.
I’m Johnny Boursiquot, and joining me today are Jon Calhoun and Mat Ryer, whom you’ve come to know and hopefully love, as regular hosts on the show. Jon, Mat, do you wish to be loved?
It’d be nice.
Are you still working on that, Mat?
Here’s the key - you have kids, and they’ll love you unconditionally. At least for a little while. When they get older, I can’t guarantee anything.
Yeah, it is challenging, as somebody who has a teenager, and some little ones… I can definitely see the drift happening. [laughter] In real time… Yeah, it’s a surreal experience. So our special guest today, not to be forgotten, is Derek Collison.
Glad to be here.
Good stuff. So you are best-known for your work on NATS, a well-liked distributed messaging system written in Go. So yeah, definitely good to have you here. The topic for today’s show is going to be the challenges of building distributed messaging systems, and we’re hoping that you will be able to shed a lot of light on the topic for us.
Before we start, I must say that when I was first getting into Go, and really starting to sort of say “Okay, you know what - Go seems like the next best thing for me, for my career, the next technology I should jump into, the next thing I should learn”, I went to the first GopherCon back in 2014… And yours was probably the first or second talk of GopherCon. You blew me away with that talk. You were talking about NATS, and how Go was a good fit for these kinds of high-performance systems… And from then on I was hooked. I was like “Okay, this is really cool stuff.”
[04:12] You were showing all these little benchmarks, and everything else… I’m like “Okay, there is a lot to this. There’s a lot to the language itself; the technology you were working on was sort of demonstrating, because we know NATS - and you’ll probably get into that - was originally written in Ruby, and you switched to Go for a number of reasons, performance probably being one of the top ones… But basically this showed the power of the language and the kinds of things you could do with it… So why don’t you introduce us to NATS, and then we’ll get into – this is not a show purely about NATS, but it’s certainly a vehicle for proving out some of the concepts and things you’ve learned… But why don’t you introduce us to the concept of distributed messaging? What is it, why do you want it, why should I as a developer care about it?
Well, it’s a great question… I’ve been doing this now over 25 years. My career started with being thrown in the distributed systems by accident. When I was coming out of university, we were still on the scale up, not scale out; scale out was actually kind of a bad word… And it happened to be for me that my first job out of school was at the applied physics lab of Johns Hopkins University on the East Coast - from where you are, Johnny, and I was, originally… And by happenstance I got selected by the second-best physicists at the lab, not the first.
The way it worked back then was that top physicists got almost all of the super-computer time… So I was working on advanced visualization for large datasets, and that kind of was my passion in school. My bosses kind of came to me and said “Hey, we don’t have a lot of super-computer time, but we have these 12 (Sun) Sparc pizza boxes, make those go as fast as the connection machine.” I was like “What?!” So I’d go down this path of trying to wire these boxes together, and figure out how they can coordinate, and break the work up… All of the things that I think a lot of listeners probably today take for granted; it’s just the way we do things. But back then, it was like “Oh, really? I just can’t get a faster computer?”
And then when I went to California in late ‘91, I ran into the exact same problem with a healthcare startup. We had so many doctors wanting to watch the federal trial data that I was generating, it was crashing the server non-stop… So I had to figure out how to scale that out, because I had no budget to buy a faster machine. And the little next step, even with the 68040 chip, if anyone remembers those, just couldn’t handle it.
So eventually, I think I got smart and said “Oh, maybe I’m supposed to be doing something like this”, instead of the stuff I thought I was gonna do. So I joined a startup called Technicron Software Systems, that became TIBCO. They were trying to revolutionize the way finance was working in Wall Street, and specifically was around the notion that we had started to get to a point where the pattern was breaking down. What I mean by that - and it’s a very long answer to your short question, but… The way Wall Street was working back then, before TIBCO came along, was stock distribution - I was gonna update you on the stock of IBM. It was like a telephone call. I would dial you up, tell you what it was, hang up; then I would go to the next person, dial them up…
And when the number of updates was low and the number of people who cared was low - and by the way, this was people looking at terminals, not machines yet… It wasn’t that big of a deal. But they already got the sense of “Uh-oh, this is gonna be a problem when Derek is the one last at the end of the line, and it takes three seconds to update everyone. His data is now older than someone else.” So whoever is at the front of the line has an advantage, right?
[07:51] So the way we pitched this to Wall Street wasn’t low-level technology, and unicast, multicast, Pub/Sub type stuff, but we simply said “We wanna change the paradigm from a telephone call to a radio broadcast.” Everyone just tunes in on the station and they all get the updates at the same time. So Pub/Sub was kind of born out of that driving problem space, that opportunity… It’s existed for a long time, and finance verticals, other verticals were very into it. Of course, then we had multicast, and multicast was supposed to solve everyone’s problems, but it didn’t… And what’s happening now is that people are starting to come back to the tech, but not necessarily the old ways of thinking about it, but to solve newer problems in modern distributed systems in cloud-native stuff.
Messaging is simply – think of it as a connecting technology to glue together distributed systems. There’s lots of those today, and you usually have things that are very specific, I’m talking to a database (MySQL or Postgres or Redis, as a KV, or an object store) and then there’s some that are generic. What’s interesting though is some are generic and they’ve actually kind of layered on top of point-to-point communication, which defeats the whole purpose of looking at a technology that allows multiple patterns.
So for me, most people are using the absolute wrong technology to connect the distributed systems. They’re using one-to-one, point-to-point request/reply. And they build a whole bunch of stuff around it, to get around the fact that that’s not a good pattern to use to build distributed systems. So messaging systems start coming into the picture again. NATS itself - I built it, I could care less if anyone used it; it was solving a problem when I was architecting Cloud Foundry at VMware, which is now Pivotal, now back to VMware… It was kind of a stab for me at trying to develop an enterprise platform as a service when Heroku and Google App Engine were kind of all of the rage… And I’ve just been building systems like that for decades, and that’s just how I build them, with messaging systems. But I realize that most people don’t. They’re like HTTP, or today gRPC, or maybe you see Kafka, or Pulsar type stuff.
I just build them always like that, and a lot of the systems I built in the late ‘90s are still running, 25 years later, in very mission-critical situations… So usually, I don’t get yelled at or kicked out of a room if I walk back in and it’s still running there… But I realize that most people didn’t care or wanna worry about that. Scale out was still fairly new, it was just starting to take hold, but these days I think people are coming back to it on their own. They’re saying “This just doesn’t feel right”, you know what I mean? So that’s kind of where the history of why messaging systems I think popped up, how NATS came to be… But again, we never planned on anyone to ever use it, to be honest with you, when we first built it.
So it’s a nice surprise that now you’re in – what, version 2? It’s gone GA a little while ago?
Yeah, I don’t know how many releases we’ve done… Obviously, what you were talking about, Johnny, with GopherCon - by the way, that was great; I got to speak right behind Rob Pike, and they asked me specifically to not just point out the things that I liked… To point out the things that I was struggling with.
If you go back and look at that talk, it’s mex talking about a lot of the performance struggles that I had early on… But to be honest with you, it is one of the best bets we ever made. We started on it when it was 0.56, and it actually wasn’t performance, believe it or not. The reason that I wanted to move away from Ruby, which is a language I still like, was deploying production systems with the dependency management piece had become so painful for us, trying to do Cloud Foundry and VMforce, and later some other initiatives… That I said “I need something that does not do that.”
So the only priorities I had were produce the static binaries, and the fact that the biggest thing for me, believe it or not - again, very early on with Go - was that it had real stacks… Which means if you’re a programmer and you know what you’re doing, you have the ability to relieve the pressure off the GC… Because Go’s GC was very primitive at the beginning. But it didn’t matter; I saw it right away. I was like “I can use a stack, for real.” And work that I had done at TIBCO took me three months in C to figure out how to transparently transition from a stack to a heap pointer automatically… We got that for free in Go, right? Stick it on the stack; if it outgrows it, you just auto-promote it. So that was the only two things that made my mind up, and the rest is history.
[12:34] Probably more familiarity with messaging systems on the call than perhaps Mat or Jon, so I’m trying to leave a little bit of room for them to sort of jump in and ask questions… So to me, from a developer’s standpoint, which you haven’t jumped into yet - so we’ve gotten a use case of why you created, went down that road and created this technology, so this fan-out approach, where you say “Broadcast this, and allow those that are interested to pick up on that broadcast.” That way everybody has the same timings in terms of what they get notified of and when.
From a developer standpoint, I hear that, and I’m like “Okay, well that’s a way of decoupling. That’s a way of basically saying “Hey, I don’t have to have this component in my architecture and my system be so tightly bound to this other one that if something even happens to this other one - maybe it goes down, or maybe it’s running in an impaired state of some kind, that I’m still in control of my own environment (destiny, if you will). So how much does messaging, or how much should messaging play a role, especially in this sort of [unintelligible 00:13:36.13] nanoservice, all bunch of distributed pieces running everywhere? How much of a role do you think that should be playing when you’re considering architecture like this?
I think that’s a great question… And like I said, I took a stance for probably almost 20 years, of “Oh no, just use whatever you’re gonna use.” I really wouldn’t even engage with people, because it would be a two-hour conversation. I would still probably lose the argument at the end, so to speak.
What’s happening now is that things are kind of coming back to it. So just to level-set - one of the things that I care deeply about with messaging systems is a couple things… And it’s not about messaging, especially not about the message broker from the ’90s type stuff; we need to not think of it like that. We need to think of it as a ubiquitous technology, kind of like network elements.
I was around when there was a single cable, and if you ever kicked the terminator cap off the end, it took the whole network down… And I saw the birth of hubs, and switches, and smart switches, and top of rack, and all the crazy network elements that we have… And just as well, a modern system needs to be doing that similarly. But to start out with - the first thing it does is it says “I am gonna do addressing and discovery not based on an IP import, but on something else.” You can call it a channel, a topic, a subject - I really don’t care what you call it.
Now, again, for a while, people were like “Why? This doesn’t make any sense.” But in today’s cloud-native world, I would argue what everyone is doing today makes no sense. We struggled so hard to change servers from pet to cattle, and yet we’re still saying “Oh, I wanna talk to Johnny’s service, so I need to know the notion of an IP port.” Now, I know what the audience is probably thinking, and I’ll walk through an example of what this really looks like in practice.
The other thing too is that messaging systems, when they do that abstraction, a lot of people call it pub/sub. And we still call it pub/sub, but again, we’ve gone away from that, because it’s kind of got a bad rep. But what I mean by pub/sub is that if the technology can understand multiple messaging patterns - one to one, one to N, m to N, and then one to one of N, meaning I can automatically say “Hey, I wanna send a message and only one of you in this set will receive it.” That’s kind of what NATS does at the very basic levels…
And folks always ask and they go “Oh, I just don’t need any of that stuff”, and I say “Okay, a couple things… One, decoupling is good.” Pets versus cattle was legit; let’s make sure our connected technologies follow the same path, and don’t say “No, no, this one’s special”, whether it’s an individual server, or a load balancer, or a special host… It just makes no sense in a modern world. So we push down on those things and we say “Okay, got it.”
[16:18] The last piece of advice I always give people from the ‘90s is “Never assume what a message…” - and a message could be a request, it could be data events, all kinds of stuff. But “never assume what the message is gonna be used for tomorrow.” So everyone kind of looks at me and says “What does that mean?” and I say “Okay, I’ll give you a really simple example.” When we talk about NATS these days, with a very fortunate, growing user base, and all kinds of crazy interest from customers and clients these days, is - modern architectures, distributed architectures are built using connected patterns, and there’s really only two. There’s lots of nuances to it, but there’s two. It’s either a stream or it’s a service.
The service is I ask a question, I get an answer, and a stream is I’m just sending a piece of data, and then I don’t care. And to level-set, distributed systems, even up to a couple years ago, were dominated - 98%+ of everything was a service interaction. I’m not saying it has to be synchronous, but everything was “I’m asking a question and getting an answer.” HTTP is “I ask a question and I get an answer” type stuff. So I said “Fine, on day one you know who’s gonna be answering that question.” So you coded up so that I’m gonna send a question to Mat, Mat’s gonna respond back with an answer. And you’re doing it on your laptop, and you use HTTP, or gRPC, or whatever, and you’re like “That’s all I need.” I go “Great.”
Let’s not even get to the point of anyone else being interested in the message. Just to start with “Okay, let’s go to production.” Well, we need more than one map. Oh, crap. Now we need a load balancer. Well, now we need to put stuff in, and do DNS… And that’s fine; production can handle that. I don’t have to worry about that. So then they do health checks, and then they have to figure out rerouting, and all this stuff… And all these big companies have playbooks on exactly how they do it, and they all look very similar, but they’re all slightly different.
Now let’s say someone says “Hey, for compliance we need to be able to watch all requests and we need to record them.” Now all of a sudden you’re gonna have to put a logger, and you don’t want the logger in place of a request-response, which is a massive anti-pattern I see being proliferated these days. It’s like “Oh, no… Put a messaging system in between it, and store it on disk, and try to retry, and stuff like that, in line with the microservice, in line with the service interaction. I’m hyped up on Red Bull, but that’s the dumbest thing I’ve ever heard. It’s like Google saying “We’re gonna write all the log entries before we return your search results.” It’s just foolish, no one would ever do that. But there’s a need to say “Hey, I need to be able to write these things down, and someone else is gonna look for anomaly detection, or any type of policy enforcement, whatever it is.
So look at that system when you’re getting to the end of the day, and now let’s say we actually wanna spread it out from East Coast to West Coast, to Europe, and you need Anycast, and DNS, and all kinds of crazy stuff and coordination on the backend of state. They become really complicated, and they’re trying to get around the fact that everything is just point-to-point; it is naturally a one-to-one conversation. Whereas with a messaging system, you write it, you run it in that server, let’s say, or whatever, but think of the NATS server is extremely lightweight. It can run on a Raspberry Pi, in your top of the rack switch, you can run it in your home router, you can plug and play and put these things together in any arbitrarily complex topology that spans the globe… And that’s another discussion on distributed systems that aren’t really distributed.
So you set it up on your laptop and you run the NATS server. Its Docker image has been downloaded 150 million times… It just runs. For example, a subsystem of GE doing nuclear stuff runs our server for two years at a time, with no monitoring, no anything. And when they come in to change things inside of that nuclear reactor type stuff, they shut it down and figure out if they wanna upgrade it or not.
[19:59] So it’s lightweight, it’s always there, it’s ubiquitous, it just works. So now all of a sudden you write the same program, you’re sending a request, getting a response, doing it on your laptop… I would argue you’ll take the same amount of time, or possibly less, but it’s on the same level. You do have to run the server.
But now when you go to the next level of production - I need more mats. Well, just run more NATS – I mean, maths, not NATS. “Oh, well do I have to put in deployment a framework like Kubernetes and a service mesh?” No. I don’t care how you run them. Run them on bare metal, in a VM, in a container, in Kubernetes… It does not matter, run them anywhere. The system automatically reacts.
By the way, you haven’t configured anything on a NATS server yet, ever. Now, all of a sudden you’re like “Okay, well what happens if I wanna do compliance and I need to watch all the requests coming in?” Just start a listener for it. It’s got all of those patterns built in, so there’s nothing changing between me who’s asking the request and Mat who’s giving the response; we have to change. As a matter of fact, we don’t even have to be brought down and restarted; we’re just running along and people can come up and bring up anomaly detection, logging… All kinds of stuff.
So as you keep making these systems more production-ready and complex, you realize that messaging gives a huge win over the other ones. Now, the other ones are kind of known. People know how to put up load balancers, and know how to do logging, and know how to do all this stuff… But when you see something running on a service mesh and you haven’t even sent a single request check and you’re spending $6,000 a month… And I can show you, we can do 60,000 requests a second, with all of the servers latency tracking, all transparently, doing it for real, and it runs on a Raspberry Pi… That also translates to OpEx savings, which is a big deal.
NATS has always been known for how fast it is, but most people tell me “But we don’t need to go that fast. We don’t need to have a server doing 80 to 100 million messages a second.” And I go “I know”, but if you think about it, if you take that same thing for your workload and put it in the cloud, you can save 80% on your OpEx budget.
So do people need messaging systems to build stuff? Of course not, because everything for the most part is built on essentially HTTP, which again, is an interesting one to me, unpopular opinion… But I know why we did it that way, and we don’t have a reason to do it that way anymore and get it stuck, right?
The notion of client-server or request/response in the old days was the requester and the responder were usually inside the same network firewall… Not firewall specifically, but essentially we’re inside the company. And everyone started to say “Hey, I want the requesters, whatever those things are, to be able to walk outside of the corporate network.” So all of a sudden people started doing this, they go “We can’t get through the firewall”, and the firewall people, or the old DB people, they go “No.” It doesn’t matter what you ask them, they say no.
So people kind of went “Wait a minute… Port 80 is always open. We can piggyback off that and circumvent this whole thing. So we can just do request-response on HTTP or HTTPS, and it works.” And it’s true, and I remember doing some of those tricks myself. We’re not in that world anymore; that makes no sense whatsoever to build a modern distributed system off of technology that existed for something totally different and was a workaround around security and firewall systems.
I wanna ask you some questions from a slightly different perspective… So a lot of the things I build are very small, and when you talk about, for instance, logging something before you send a response, I am guilty of (in my early days) building a service that literally logged every request to a SQL database before responding… And now it seems ridiculous, but at the time the request load was so low that it really didn’t actually matter that much. It was just a simple solution to fix it, and we could quickly browse requests and actually track things, and it made support way easier at the time.
So if I’m somebody like that, where I’m starting off in a smaller project, or working on my own, or doing something like that, where do you see people get into messaging systems? What problems do you commonly see them tackle, that they would want to look at it? And what are your suggestions for ways to get introduced to them? Because obviously, a lot of people aren’t gonna jump into these really complex scenarios.
Yeah, that’s totally fair… The biggest one we’ve seen early on with people starting to get interested in it is simple load balancing. So even though the request might be handled by a single one, they wanted more than one; they were immediately thrown into utilizing load balancers from a cloud provider, or setting one up themselves… From there, if they are aware of NATS at all, or another messaging system, but NATS especially, in terms of it’s just so drop-dead simple, there’s no configuration… And you get load balancing for any number of groups that you want to create.
So you can have production, dev, tests… The system essentially dynamically responds to the fact that you say “Hey, I’m interested in this request, and I wanna be a part of this group. I wanna be part of the Go Time group”, and the system just automatically responds. So immediately, they have less moving pieces and less OpEx time in terms of time spent trying to make sure the system is up, monitoring all the other pieces…
So we’ve seen that, but – I mean, that’s totally a fair question. It’s interesting, I’ve shown people how easy it is to do request/response when they’re doing HTTP, and we can do it just as quickly even running a server, because with Docker now it’s just so easy to say “Docker run NATS”, and then it’s like “Oh, that was it?” and it’s like “Yeah, you have a server now, so you can do anything you want on your platform.”
But you’re right, until people start feeling pain, they’re not gonna be looking. Or they wanna find a solution that they don’t think exists, and if they see that it’s enabled by a different technology, then that’ll draw them to it as well.
Does it have advantages beyond the production side? Does it have advantages for software design itself? Because of course, if you think we’re gonna suddenly treat these things slightly differently, we’re gonna be communicating through this message queue, that has some impact on how you then think about the design of your application.
[27:58] Yes, it can, but again, this is one where I think people still equate messaging systems with a single broker, and queuing, and back from the ‘90s type stuff… NATS is extremely good at routing and framing, and that’s it. But what is interesting now is that NATS is extremely lightweight, so a server can run anywhere. It can run on your phone, or whatever… And you can Lego-brick these together into any topology that you want. So if you think of it that way and you don’t think “Oh, I’ve gotta go through this queue”, where it’s “I’m just routing a request to the appropriate responder and getting the response back”, even just with that simple model, you’ve removed a lot of moving pieces in the transition from dev to production.
The other thing that’s really interesting to us - and again, this is where microservices has driven a lot of this, I think… All of a sudden with microservices it’s like “Wow, I’m doing point-to-point”, which you can argue for or against; I’m against, but I understand why people say it… But what people are interestingly enough struggling with is addressing a discovery in security. Everyone has their own security. And when someone raises their hand in the org and says “We can create a one org security model type stuff…”, and it’s usually painful; it’s hard to do that, and it affects the developers themselves, and you don’t have that really clean decoupling. NATS tries to preserve that.
The program that I write on day one, that goes into production, literally has one thing that changes, which is “What are my credentials?” So you say “NATS connect” and you give it a URL to a system. We have a global system that runs all over the world - every major cloud provider, every major geo. All you need is a single URL and it just works. We find the closest server, we do all the right stuff for you. But the only thing to go from dev to prod is “Hey, we need credentials to prove who we are.” But now, all of a sudden you have a consistent identity authentication and authorization system that has no private keys or passwords ever shared with the system itself, and it just kind of works. In other words, that pain point of “Oh crap, now we’ve gotta get secure, and locked down, and stuff” also goes away.
A lot of times, when you’re playing around, hobbying out, doing stuff, you don’t see that, especially for enterprise, or playing around with services that are usually exposed over the web. Where we have seen people immediately jump on something like this is IoT, and the IoT landscape, where people are like “Crap! We don’t really have anything, and the thing we have really doesn’t have a cohesive server back-end.” In other words, there’s a lot of decisions. If you’ve got an MQTT client on your device, but you don’t know what to do.
We haven’t landed it yet, but we’ve committed to it, promised it, and we’ve been coding on it… This notion that “Hey, you can just take those apps and connect them to a NATS server, too.” And this is a global topology, meaning your IoT stuff will work anywhere in the world. You can sell your gadget, your software to anyone, wherever they are in the world, and they’ll get a good experience, so to speak.
That is really cool, isn’t it? If you think about that.
What’s interesting is – I mean, I’ll be frank with you… We see a lot of folks – we’re starting to hit this weird bow wave which is, to be honest with you, making us a little uncomfortable as an ecosystem… Because we were always under the radar, didn’t have a lot of attention, type stuff, which was nicer than we thought… But it’s still good.
But what we’re seeing now is everyone coming to our front doors wither either of two mindsets. One is “Holy smokes! Kafka is too complex, too costly, too blah-blah-blah. Help me!” Or it’s this pattern where they say “We’ve got centralized things that can do request/response; we can ask this central thing about certain things… But we have all these remotes. And we want the remotes to be somewhat autonomous. In other words, we want them to be able to communicate amongst themselves, seamlessly and securely communicate with a central service…” They might be generating telemetry data, so this notion of streaming, that getting collected…
[31:52] And what’s interesting is that these use cases are all over the board, from the person walking in the front door, saying “Hey, this is the problem we’re trying to solve.” But when you look at it as “Do you have a centralized thing and you have these things as remotes”, well, remotes can be anything. It all of a sudden starts lighting up as the same pattern - they just want a consistent communication system, multiple operators… You can run your own server, I can run my own server, but the system still works. It’s almost like putting a cell power in your backyard and having great cell service when you’re at the house, but as soon as you leave, it just transparently works. It connects to Verizon, or AT&T, or whatever.
So when we present that to them, whether they’re talking about the end point is like a Bose headset, or a Peloton bike, or it’s a factory, or a plant, or a telephone pole, or whatever, we’re like “It’s the same pattern.” We have some centralized service that does services, and it collects data. It might actually throw off data as well; telemetry, or sensor-type data… Let’s say it’s an airport - it can operate on itself, but it’s got its own servers; the airport staff monitors those servers and it runs them, but the security model is cohesive. All the backend service providers, the airlines, whatever that is.
So those patterns we’ve seen appear quite a bit, and not in only people that know all the buzzword bingo of Kubernetes and service mesh and cloud-native. It’s more traditional systems, that are trying to do more. They’re trying to expose more data at these remotes, throwing off more data, doing things with it locally, uploading it or centralizing it… And we see that pattern repeat and present itself six, seven, eight, ten times a week, non-stop these days.
Yeah, so it feels like you’ve actually identified a real good abstraction for a lot of this stuff… Because that’s the big problem whenever we try and solve problems like this in a generic way - you build it for one case, and it’s perfect, and it doesn’t quite fit with the next case, but close enough; a few bits of configuration get you through it. And then the third case - it really doesn’t fit at all, but it’s too late…
So yeah, that does sound right. Do you have lots of examples then? Are there lots of examples where NATS in particular is used in IoT?
The IoT stuff is new and up-and-coming. Everyone is looking heavily at it. Right now, most of them are bridging across MQTT directly, usually on-device or on-controller… So plants, factories - we’re seeing a lot of that stuff. And they’re desperately waiting for us to put together the native connectivity so that they can have a cohesive one type of system to manage. It’s not a bunch of silos that they’re trying to glue together.
And what’s needed – there’s one other thing that NATS did about two-and-a-half years ago, that we felt was kind of important, maybe, for certain use cases… And what’s happened is it’s exploded into one of the most targeted things. Well, there’s two of them that we did. One was that we made it truly multi-tenant.
Earlier in the podcast we were talking about distributed systems, and I would argue most distributed systems aren’t distributed if you really wanna stretch them. So pick your favorite open source technology that you consider a distributed system, and tell me how well does it do if you have pieces on the West Coast, East Coast, Europe, and Asia-Pacific. Most of those projects will say “Yeah, don’t do that. Just run it in one region.” So it’s a distributed system, but it’s a one-region distributed system. Which is totally fine, by the way, because NATS was the same way about two years ago, before we dipped into “We need this ability to span the whole globe if we want to” type stuff. But we knew it had to be multi-tenanted, and it had to be – I’m dating myself, but I call it Pepsi & Coke secure meeting. They’ll use cell towers and cell service, but certain software, they’re saying “If Coke is here, there’s no way we’re gonna use it” type of stuff. So it had to understand that.
[35:52] So we created the notion of accounts. Think of it like a sandbox or container for messaging. All the users in that account can see each other, no matter where they connect in the world, but they cannot by default see anybody else’s account. But we introduced the notion of “Isolated by default, but secure sharing.”
So remember, not talking Pub/Sub, but talking streams and services. You can say “I wanna export one of these”, and then other people can import it. It’s like a Facebook thing. Both of you have to agree that you wanna do it. And you can make it public, but most of the people say “No, you need permission from me. I will sign off on a token”, which is public-private key cryptography, so no private key is ever in the system, meaning you don’t have to trust the operator of the server for it to do the right thing, to allow them to cross these boundaries.
So that was a huge win, and there’s even a smaller example… A little off in the weeds, so we won’t go too far down it, but this notion that people who do know messaging systems and have been using them, it’s like “Wow, we used to have two-week design sessions on the subject space, or the topic space. How many tokens, and what’s where, and who can step on each other’s foot.” And when we did this, everyone realized that they were so lightweight that they’re like throw-aways. So organizations are putting a single account per every microservice that they do, and then their imports or their dependencies, and then their exporters, their API.
One of the other things too is it’s your sandbox, it’s your world… So when you import something, no one else can tell you what to do with your subject space. So you tell the system where you want it to show up. So I could release a service that just says “Send me a request on request, and I’ll send you a response”, and you can import and say “I want it on derek.coolservice.request” or whatever, and the system transparently does that. But the kicker was - and we did this with our own system, and people have started really getting onto this… It’s “Yes, you can put it wherever you want.” So I could export something like – NGS is our global system, but ngs.usage.star. And it’s a service, meaning you send me a request - a service interaction - and I send you a response. And I’m expecting ngs.usage.something. Star is a wildcard in our terminology.
So Johnny comes in and says “Yup, I wanna sign up.” I go “Great. Here’s a secure token that you can be allowed to import this.” But what you can import is ngs.usage.johnny, and that’s it. So you cannot send to ngs.usage.derek, or ngs.usage.mat, or whoever. But again, Johnny controls his own sandbox, so what he says is he goes “Great! I’m just gonna stick it at ngs.usage”. So now all of a sudden what happens is that everybody’s experience is “Hey, if you want usage, you just send a request that looks like 1h for one hour, or 24 hours, or whatever, to the same subject”, and you get a response - JSON, all bytes and messages sent and received. But the backend knows that it’s guaranteed to receive messages only from people it’s authorized, and it’s guaranteed that that last token is who you are. So they have a secure context built in.
So now you can build a system where that secure, authoritative context - submitted context of who’s doing the request - is something that you don’t have to think about. You don’t have to build lots and lots of stuff on top of to get it. I know that’s a little geeky, but we’ve seen people who the light bulb goes off and they go “Holy smokes!” And then I can deploy these responders anywhere I want.
The other big thing that we did - which again, is subtle and I haven’t seen anyone else do it, but I wanted to talk more about the abstract - as we’ve seen for many years, and I’ve seen this for a lot of my career as cloud and SaaS took off, that it was an or-conversation. You could run your own servers, or you could use the cloud service. So we thought really hard about “Hey, how can we make this an and-conversation?”
What we did was we said “Hey, we have all these different network topologies”, which are just the way servers talk to each other; so they can form small clusters, and then you can put clusters of clusters together into super-clusters, and all kinds of fun stuff… And they all use different topologies. Why don’t we create one that allows you to extend a super-cluster at will, like a hub-and-spoke.
[40:06] So when we did that – and by the way, you can mix and match operators and security models, meaning you can use a shared SaaS utility inside of a big Fortune 50 company or a global utility, but you can also run your own servers and have the best of both worlds. Those two things - that account isolation with secure sharing, and then the ability to mix and match a utility model, if you guys have dealt with enterprise companies (and I know you have), we see a lot of this. “Hey, we had a problem and we picked this technology and we did a POC. Then we did four POCs. Then someone raised the hand in one of the meetings and said “Hey, instead of having four silos or six silos, let’s create a utility that everyone can use.” But you either have to use it or you don’t, but we’re gonna mandate that you do it.” And there’s two things that happen.
One is the effort fails spectacularly, because the code is not actually multi-tenant. It doesn’t really understand it. And it can’t stretch. It can’t actually service people on the East Coast, West Coast, Europe etc. Or people go “No, I’m not gonna do that. I’m running my own servers anyway.” We heard that so much. Even in the previous company, Apcera, I was like “Man, if we can give them the ability to have the best of both worlds, that could be interesting.” And we did it because we knew we were gonna do MQTT, so we have to honor their security model, which I believe still username/password or client cert is the highest level they have… But it would have to mix and match into a global NATS system that might be using our forward private key/public key scenario. But those two have really resonated with a lot of folks trying to solve these problems. I wish I could have predicted it, but I’m not that smart; I didn’t. But we’re excited about it.
Well, you talk about predicting things that people are gonna need… NATS 2.0 is generally available, as it says on your website. What was the change? Why did you need the breaking major version update? What changed is significantly, and how did that come about?
The way NATS 2.0 came about was I tried to think through the notion of “Is there an opportunity to create a company where NATS might play a role?” Prior 2.0, NATS was rock-solid, performant… Again, it was like a lot of distributed systems today - you couldn’t stretch it. It liked to be close to its neighbor, have good throughput, good RTT and so on. And it wasn’t a company, by the way. A lot of startups aren’t companies, they’re just features of their systems.
So I really wanted to think hard about that problem, regardless of whether NATS was a fit. And what I came up with - and we’ll see if I’m right or wrong; probably wrong, but that’s okay - was the notion of the internet, that defining moment in ’94-‘95, and then in the early 2000’s with the realization of the global cellular network, and what those two platforms with hyper connectivity provided and it actually ended up, we haven’t had that event for digital system services or devices. We pick a different technology, and even if we use the same technology and the same company, it’s siloed up Yin Yang, you’ve got 40,000 RabbitMQ servers running, or whatever that is… So I said “What if we try to create the first secure digital dial tone for digital systems services and devices?” That all they do is connect, and it’s kind of like connecting to the cell tower; you’re not even aware of it. It just kind of works, and it’s all in the background. But we have the ability to connect anything that’s out there.
So that’s where we started, and NATS was not up for that. It needed three major things that we identified right up the get-go, and then the fourth one there we just talked about, about that hub-and-spoke extending utility SaaS and private owner mix. But the first three were pretty simple. The security model had to really be forward-looking, and there’s lots of fun, interesting math, and all kinds of cool stuff underneath the covers, but the easy version is the system should never have private keys or passwords, period. It’s really simple. So if someone routes all of our net server devices and steals them all, they don’t have anything. So we did that.
[44:08] The second was multi-tenancy, which we talked about. That was a big one. Multi-tenancy is not something you can slap on top. By the way, security isn’t either, but most people slap both of them on at the end. It just doesn’t work. So multi-tenancy goes all the way down to the root of the codebase, and start from the ground up.
And then the last one was global topology, so the ability to have a network topology span lossy systems, very high RTT, low signal to noise ratio types of things. You can’t use the same topology to talk between servers that you expect to be really close and buddy-buddy with, as to one all the way in China, let’s say. So those three main things had to be done.
Now, what’s nice is that as you get older, you remember lots of stuff, meaning we knew we could never break anyone that was using NATS before 2.0. So it’s totally backward-compatible. Any config that works with NATS 1.0 will work with NATS 2.0. Everything works. Which was hard, but we thought it was important. But the major version was the signal; this is something different. This is not a message broker, a queue. It is a ubiquitous routing and framing technology that can run anywhere, and can do any type of pattern, but the major ones are services and streams.
So with this approach now, is it a fair comparison for somebody to be like “Well, I want some sort of a broker system, so I’m gonna consider maybe SQS, or RabbitMQ, and I’m gonna also toss NATS in there”? Are they even solving the same kinds of problems anymore, or do you think it’s a fair comparison?
No, I think they’re basically solving similar problems… But again, you run out of runway. So with silos - not true multi-tenant, the security angle, all that stuff. We’ve seen a lot of legacy messaging tech coming to us, with a pain point somewhere in that realm. “We’re tired of managing all these silos. Everytime we wanna connect two things, we’ve gotta figure out how to glue these two separate systems together, just to take advantage of something that we should have just been able to flip a switch and it just works type stuff.” But we don’t see a lot of people come and say “We wanna mix and match SQS, or Google Pub/Sub.” I mean, we’ve seen it a little bit, but usually they go all in on NATS after they talk with us. Now, where we have seen interrupts is with MQ series and Kafka. People want to run NATS, and they still wanna run the Kafka, whether it’s an existing investment, or something new - they just want it there, and it’s not going away. We wanna protect that investment, so we’ve done a lot of those interactions.
That is really cool.
So the company that you talked about was built around the technology… How does that work then? This is a tech podcast, but I know a lot of the listeners are also quite interested in the commercial aspects as well, things like this.
Yeah, that’s a great question, and hopefully the listeners will enjoy the next four hours of dialogue on this subject… [laughter]
If they listen to it two times, though. That is only two hours, so…
Yeah… Frankly, open source and smaller companies that are trying to make a viable business directly off of open source is a challenge. The whole industry is going through some really serious pains, because a lot of the open source is being funded through indirect revenue channels… Like Google. They don’t have to charge you to license Kubernetes, or do anything like that. They have their reasons for doing it, but they’re making their money elsewhere… Which means that it drives a consumer bias that it should be free. And that’s a challenge. When I saw that with my last company, [unintelligible 00:49:55.24] it’s a huge challenge.
So part of that “What do we wanna build a company to do?” was equally weighted with “How are we gonna make it a viable business?” So for me - again, a very unpopular opinion, I’m sure - I personally don’t believe in open core. I think it’s freemium enterprise repackaged, and it’s gonna fail, just like freemium enterprise did. I could be wrong, but my bet was that there’s really only three ways to make money off of open source: run it as a service, bundle it with hardware, because a consumer bias of a physical thing totally changes their bias, and they have no problem paying for it… And then the last one is augment with a service.
Some people kind of push back and say “Well, that’s open core.” But the distinction is it’s kind of like your phone and an AT&T contract. You’re augmenting your phone with that, and it makes it better, and it makes it actually – you know, you need a telephone, or a cellular plan, or whatever… But I do draw that distinction. So what we did was we looked at “How do we take those three rules (which could be wrong, but that’s my bet) and where do we go?” Because running NATS as a service, as a silo - this is a big deal - it was a no op. We tried it at Apcera, no one signed up. And they were so nice to us, and they said “Derek, we literally run this on a Docker container, and it’s been running for three years now. We don’t even monitor it. We don’t even care. So why are we gonna pay you to run a singleton?”
So we thought to ourselves, how do we make it such that “the sum is greater that the parts” type stuff? One is that we can create a global network, all cloud providers, all major geos… Which you can do yourself; there’s nothing that’s not open source to do that. But it’s cost-prohibitive. If you just wanna use two sites, so one in Europe and one in U.S. type stuff. We did that, obviously.
[51:45] There’s always the notion of on-premise recurring support. That’s kind of the marquee that you want. NRE consulting, training education - it’s usually one-to-one, so you don’t get a market multiplier there whatsoever. Recurring support - if it actually is clean, you can get your 10x+ kicker, for those who know all the market value accelerators, and things like that.
So we care deeply about that. We do do NRE training and consulting, but we know it’s not a huge source of revenue. It’s a huge source of customer experience and satisfaction, but not revenue.
So we have the SaaS model with NGS, we have on-premise recurring support for our stuff, and whether it’s good or bad, a lot of people coming to NATS now are like “Wow, that’s so cool! It took me two seconds to write the NATS app and run it against the demo servers”, which have always been free and available and such like that. “I wanna do something hard.” It’s just natural, engineers are just like that; they’re like “That was too easy, I wanna do something hard.”
So NATS now allows you to set up some crazy complex topology, with some crazy security rules… So they try to do that, and now all of a sudden because of that complexity people go “Oh, we wanna get support.” I don’t like that; I like things that are simple and just work, but we have noticed that.
The other thing that’s interesting from our perspective is that we don’t believe NATS is just a connective technology. It is, but how do you value it as a user? You know, I’ve been doing this so long, and my bias is “Everything is just a message.” So whether you’re using a database driver, or whatever, you’re just sending messages back and forth. So what happens if we say “Everything is just a NATS message”? What I mean by that is you get everything that NATS does, you connect, distributed queuing, load balancing, circuit-breaking, self-healing… It puts itself back together, by the way, without any help from any platform technology type stuff, really…
But what happens if we said “Hey, you know that export and import, those streams and services, and you can export one and someone can import it?” What happens if the system just has a service that you can import, that says “It’s a KV service”, and now you can do zero-trust, secure key-value set and get from anywhere in the world, with any application. Hm. Okay, now all of a sudden NATS can do simple state storage and retrieval. It doesn’t solve all the apps, it doesn’t solve world hunger, but hey, okay… Now what happens if it can do object storage? Very large objects, very efficiently. The system dynamically moves things around, it understands where requests are coming from… And again, because we don’t care, we can just move Mat and run them wherever, without anything special coordinating. That’s very possible.
And then going further, it’s like, well, what happens if there’s a GraphQL service, and I’m just sending requests over NATS to a GraphQL service, but all the security works, all the authorization/authentication is built in, it’s what I know, it just kind of works? So those would be premium services that we could charge for. So you get – I call it basic cable, the dial tone, and then you can get the premium channels a la carte if you want to.
And then the last piece of the business model is – some of those may be very compelling; let’s say anomaly detection, or some advanced analytical statistics on traffic patterns, and stuff like that, that a company might say “We really want to use that service, but we can’t use yours. We have to run it in our own data center, our own VPS”, whatever that is. Then that’s software license revenue.
The way we envision the company succeeding is the first bow wave, the first 2-4 years will be recurring support as the major revenue driver. Then, as we land – we landed web and mobile, we’re about to land MQTT for IoT… Those two will drive more of the NGS stuff, direct. Or they create a leaf node that they connected with and use NGS to talk across the world, which we have a couple of folks doing already.
That’s kind of the whole business model in a nutshell. We’ll see how it works out, but it’s a challenge, for sure. The consumer bias that it should be free is the hardest thing for any OSS developer to fight against.
[55:45] I feel like you run into that sort of thing even beyond open source software. It’s kind of a weird thing, I guess, in the software world… To give an example, people write tutorials and books that teach things, and in most environments people expect to go out and have to buy a book. But in the programming world, it’s very common to assume “Well, somebody will just make this free.” And that happens for all sorts of things.
There is some upside to trying to help people access things, and trying to make it accessible to the world, but then there’s also, like you said, that flipside of it’s very hard to support, unless you get to like a Google scale, or some sort of scale where it’s possible. Because prior to then it can be very challenging.
I think you raise a great point, and one of the things that I debate with folks quite a bit is, you know, the notion and how people frame support. So people go “Oh, I’m paying you because of me. And if something’s wrong, I want you to help me”, which I totally get. But if you look at like healthcare systems - the Western worlds mostly do the same thing. I pay for healthcare when I’m sick, type stuff; for when I’m sick. But if you look at Asia-Pacific countries, it’s the opposite. You pay for insurance or you pay a doctor when you’re well. You don’t pay them when you’re sick.
I’ve had this debate - and to be honest with you, I lose most of the time - with customers saying “We’ve been running this for two years. We love it, it’s been running for two years, we haven’t had to touch it, we don’t even monitor it… Why would we wanna pay for support?” And I say “You wanna pay for support so that as we keep making it better, it keeps doing this thing where you never have to have an issue to deal with, or they’re very low.” But to be honest with you, it falls on deaf ears, most of the time. It’s very challenging. We have a ton of usage, production usage in the tens of thousands of users, and a minuscule percentage of people that want to pay support.
Stop writing such good software. [laughs]
Yeah, and you know what - I dealt with that earlier in my career, where people were like “You need to make the book longer.” I didn’t write a book, but it was like for a manual. “It has to be 200 or some pages so we can charge more.” Or “It needs to be more complex; they need us to get it up and running.” And I just resisted that. I said “That just doesn’t feel right.” It should be simple and approachable both from an application standpoint and an op ex standpoint. But the current state of the world is that if it actually nails all of those, you will plummet your voluntary support contracts, for sure.
Some of this is challenging, because – you even said the open core model is… I think that one’s hard to get right, because it doesn’t work for a lot of software. An example in the Go world is a lot of people have probably used Caddy server, and I think when they tried to transition to a paid model, they struggled because the core thing that a lot of people really wanted was it was hard to charge for that… Because it was already there, and they were used to getting it free.
But then, you come from the – I think you said the Rails world. Did you mention Rails earlier? Or Ruby?
Yeah, more Ruby, but I definitely understood the Rails community, and all that.
Okay, so in the Rails world there was something called Sidekiq, which is like a background job processing type thing. Whenever that came out, it’s something that – the core of what a free user would need is actually there for the free users… They actually had a nice separation between what enterprise users would actually want and pay for, so as a result that has worked well for them. But I think that most open source it’s very hard to find that distinction, and it causes issues where it’s really hard to make that business model work, because either nobody pays you, or you basically make software that the free users can’t get any value out of it.
Going back to the metaphor you had, with the phone, and you can buy an AT&T plan - it’s almost like you sell them the phone and say “You have to pay me for the battery though”, and they’re like “Well, that’s not very useful now. That’s not augmenting it. That’s making it functional.”
You’re absolutely right. And what was interesting - and you could see this a little bit with the Kubernetes ecosystem - is a lot of people would knee-jerk around a new technology taking off, and approaching let’s say an open core model saying “Well, we’re gonna add monitoring and management tooling. That makes it easier to ______”, whatever. And I had a friendly bet with a bunch of folks that I have known for years, to say “Here’s why I don’t think the open core model will work…” Because these technologies now, when they become very ubiquitous – Kubernetes is not easy, but you can get through it, and there’s a large ecosystem, so there’s a ton of people working on it…
[01:00:20.24] The first thing that ecosystem is gonna do today - which it used to not do, because the barrier to entry was too high for someone to sit down and write a monitoring and management system… As I said, these ecosystems spike so fast and so quickly that anything that looks like an easy, tangential business opportunity around an open core model gets sucked in.
Everyone complained on the first versions of Kubernetes about a lack of monitoring, management tooling, dashboards etc. and I think it took them less than two releases to have a skeleton version of that… And then of course, no one now would start a business on a dashboard for Kubernetes type stuff.
So you have to think “Hey, how do I make what I’m offering way more valuable as a service?” And again, that’s even a challenge, because if you really do nail the experience for a single use case, you might struggle with “Where’s the value with you running it for me, versus me just going ‘docker run’, or whatever type stuff?”
But I think this notion of this macro trend, that I believe the opportunity to enable this hyper connectivity opportunity for all the digital system services and devices is huge, if it’s done right… But it has to be very approachable, very easy, open source, good governance model, good OSS licensing… I’ve gone through all the phases of closed open source, and licenses and governance bodies.
And you can mix and match models, meaning I could use in Google Cloud, Google Cloud’s version, that runs great. But if I want to run my own server, I could, and it all just kind of works seamlessly. I do believe that has value that people would want to pay for.
Now, the basic dial tone that we talked about, just for the listeners - that is gonna be a low-margin volume play. The premium services - we can get better margins, we think. And of course, on-premise, the recurring support - we’ve got multiple tiers there… But it’s definitely something that folks who are thinking about starting a company - I don’t wanna ever dissuade someone from starting a company, because I think it’s great; I think it’s amazing. But it’s very hard, it’s very lonely… But I do encourage you to think really hard about “Am I building a company, or a technology feature?” If you’re building a company, do I have a really thought out vision for what the business model looks like, or do I just say “Oh, I’ll figure it out once I get a whole bunch of eyeballs, and millions of users?” That doesn’t work as much anymore.
Yeah, that is so true, the perception problem. That is a challenge for people. We had a similar thing - we had this technology that was extracting metadata from video content using machine learning… So machine learning models would look at the video frames, and then actually be able to describe what’s going on in a video; and of course, make that searchable, and all those things you can imagine once you’ve done that… So we were thinking maybe that would be charged per gigabyte, or something. We tested it, and people were saying “Well, to store on Amazon it’s only 2 cents/gigabyte”, or something. And we’re like “Well, yeah, but that’s just storing. This is using machines.” And they’re like “No, it’s way more expensive than Amazon…” And it was like “Okay…” Common sense - we almost should just not assume that there’s common sense around, from my point of view… Do you know what I mean?
Yeah, and I’m an angel investor and I consult with lots of smaller companies, and a lot of companies struggle with “How do I turn it into a business?” I said “Think about a way where my experience with your software becomes better.” If you have a system that collects data from everyone, keeps the privacy concerns in place - that’s a big, big deal, of course… But it keeps all of that stuff at bay. Essentially, it makes my use of the product better because of that.
[01:04:09.29] I was fortunate enough to work at Google from 2003 to 2010 or so, and I remember some of the – obviously, Google has a lot of extremely bright people that think of very, very elegant, complex solutions to very hard problems. But what I liked to see a lot within Google was extremely simple solutions to complex problems. So spam was a huge deal when Gmail came out. It was just awful. But as Gmail became so popular – and we had lots of usage on it. Even in the early days. And if we just put a little button that said “Hey, I don’t like this message. It’s Spam”, it would see all the signals and say “Wow, in the last five seconds a thousand people clicked on the same message and said it was Spam, then we can automatically mark it and move it off.” So that power of collecting data and using it to optimize individuals’ experiences is a model that I’ve talked to a lot of startups about, and said “Can you encapsulate what you’re trying to do with your software where the service is augmenting - it’s not an open core, you’re augmenting with it - but your experience gets tremendously better because of it, and it’s something they can’t recreate?”
For your case, Mat, I agree with you. They’re just saying, “Wait a minute, I can store a gig in my own data for way cheaper than you’re doing it.” But if you said “Hey, can you collect all of the spam signals from 40 million people for Gmail?”, they can’t do that. They’re just like “I can’t do that.” So then what happens is they go “I know I can’t do that on my own. Does it really help me that much that I’m willing to pay for it?” And that’s always the trade-off.
It’s funny, because that is almost exactly the way it went. So that was really funny you said that.
It is indeed time for our Unpopular Opinions…
I know you dropped a lot of gems throughout the show here, but I’m wondering if you have a solid, solid unpopular opinion for us…
Well, in terms of the gems - remember, it’s advice and it’s free, so you get what you pay for. So your mileage may vary. I’m usually never short on unpopular opinions, so my two probably that are applicable to this is that most systems that you think are distributed aren’t. So you pick your favorite open source, and I’m telling you it’s distributed as long as it’s all close together. If you try to stretch it, it’s not distributed anymore. And the other one is that I really do feel that us using HTTP to connect modern distributed systems, where you have to have sidecars, and proxies, and load balancers, and everything under the sun is just madness. I cannot believe people keep doing it and go “Yup, that’s the way we should do it.”
I understand how we got here, I can’t understand how we haven’t got to a tipping point to go “Hey, we don’t have to do it like that anymore, because everything’s modernized, so we can actually do something real now” type stuff. So those are my two.
Yeah, that is a pretty good one. There’s examples in real life like that. I find real life as basically like someone’s legacy code, and we’re just born into it, and then we’re like “Why would that be the way it is?” But yeah, I get what you mean. I like that. You’re right, HTTP is kind of crazy… But it works, doesn’t it? It *just* works. It mostly works… And so it wins.
Most of the time.
[01:07:52.12] It works if you have a large team that can watch everything and keep the lights on. So if you don’t have to deal with it, then for you it’s like “Well, this works the same.” If all of a sudden you go “Yeah, it just works, but man, I don’t wanna be spending $8,000/month on my little system, so now I’m gonna do it myself.” Then all of a sudden you start to realize why do we have all the… And this whole notion of – I guess a third, smaller one, but this whole notion of “Architect everything with side cars.” That also drives me nuts. It’s like, “I’ll just add another side car to it.”
Yeah, it doesn’t stay simple for very long, does it, when you have to tackle things like that, tackle problems like that.
No, it doesn’t. And to be honest with you, we’re in a weird global situation, as we all know, and my hope and thoughts are with all the listeners, and I hope you’re safe, and healthy, and all that kind of stuff. But when you’re looking at a company and you’re trying to figure out how to drive revenue, we talked a lot about different pieces, but I’ll tell you, at the end of the day you’re either selling a vitamin or an Aspirin. And when times get tough, people stop buying vitamins.
So if you can figure out a pain point and make it easier for folks, that always is easier than everything else. So for NATS, to be honest with you, it’s two-fold. One is OpEx spend is too high; too many moving pieces, or just it’s too expensive to put it on Google Pub/Sub if we’re trying to do two million messages a second type stuff… So it’s like “Great. We can cut your OpEx out. No big deal.” Or it’s that pattern we talked about early in the show, of “I’ve got lots of remote thingies that I all want to communicate, East/West, North/South, those central things, and I don’t wanna have to worry so much about security”, and it’s just one ubiquitous communication stuff.
So those two pain points are what’s mostly driving us as a business. Maybe not NATS as a project and an open source technology, but us as a business; it’s solving those pain points.
Great. Very interesting. Thank you very much. Thanks for all the insights into the commercial side as well. We often don’t explore that on this show.
That is true.
Happy to share it, although I’m mostly probably wrong, so…
Fair enough. Like HTTP. [laughter]
And the hard part there is I think almost every open source project has been mostly wrong when trying to figure out how to build a business around it. Even ones – I’m thinking of like CoreOS… I don’t know how well they did, but they had to be acquired, and I assume that if they had a better alternative, they wouldn’t have done that… So you see ones like that, and I’m like “CoreOS seemed like it was doing very well, and… Unfortunately, no.”
Yeah, and it’s interesting… Building a software company - you can get a lot of mileage out of thinking of it as a psychology problem, than a technical, or go-to market strategy. I try to frame everything now as “What is the psychology of the consumer?”
If you’re building a kernel - except if you’re Microsoft, which even there they don’t have to let go of it shortly, is “Oh, those are always free.” Even though there’s probably multiple hundreds of millions of dollars of expertise and investment into these, the consumer bias is it should be free. So always ask yourself, “What does my consumer look like, and what is their bias around what I’m trying to offer them?” If you resentingly say “Oh crap, they’re gonna think it has to be free”, you might wanna rethink what you’re trying to do.
Can we just all start going into stores and thinking “This should all be free”? Will that work, if we all do it? [laughter]
That’s where I said OSS bundling with hardware is one of my three models, because the consumer bias around physical things is you have to pay for them. I think we were talking earlier about a book. Well, I have to buy the book, right? Even if it’s audible, I have to buy the book.
By the way, with our IoT strategy, which is slowly developing and we’re launching it, the notion of saying “Hey, there’s some functionality, but it’s built onto this little teeny thing that you buy”, even if it only costs you like $8 or $30, the consumer bias is it’s not zero, which is the biggest thing. So the hardest is to go from zero to non-zero with the purchasing and consumer bias.
Thank you guys for the time. I really appreciate it and I enjoyed it. Hopefully, the listeners got something out of it, but… I appreciate the invite.
Yeah, it’s been great having you. We’ve learned a ton about distributed messaging systems, and the excellent work you’re doing with NATS… I definitely wanna go try it now. I’m used to some of the other ones we’ve mentioned earlier, and I’m sure some of our listeners are definitely gonna be trying it out as well. It sounds very, very cool. Thank you so much for being on the show.
Before we wrap up, I do wanna mention that for those of you who are supposed to go to conferences or speak at conferences, meet with friends at conferences, and obviously the state of the world right now prevents that - we have a lot of our conferences moving online, and virtual, which is great, because it still allows us to maintain a tight-knit community.
I believe the very next conference coming up might be the GoGet community. GoGetCommunity.com is where you wanna go check out the next virtual conference coming up. Mat is gonna MC as well, along with Mark Bates, and your truly is gonna play a small part/role in there… So yeah, definitely check that out.
If you have suggestions for the show, if you have your own unpopular opinions you wanna throw at us, that’s fine - we’ll take them on in strides, and try to keep coming up with great show topics for you. Again, Derek, thank you so much for being on the show. So long, everybody. Have a good one.
If you guys have two minutes, I’ll give you one last story to top all of it, [unintelligible 01:15:22.09] We were doing TIBCO, so we have all of the large financial institutions; every single one - Goldman, Lehman, everybody. And we were partnered with Sun, so you always had to run our software on Sun in the financial [unintelligible 01:15:34.28] And the CEO came into me one day – I’m in Palo Alto, and he’s got a suit… The guy walks in, he hands me the suit, and I go…
We are live, just so you know.
Okay. I didn’t want something that shouldn’t be live going out…
That’s fine. So I said “What’s this for?” and he goes “You have to fly to New York. There’s a problem.” So I fly to New York and I come into an unknown, large financial institution, and they go “Your software sucks.” And I go “Okay, how is that?” and they said “Well…” - and I can probably say this now, since it’s a non-existent formal company… They go “We bought this multi-million-dollar Sunbox, and we run your stuff on it, and it’s terrible. They can run faster on a desktop box.” So I sat in the room and they were really not happy. They didn’t like it that they had to wait six hours for me to arrive, but that’s how fast planes travel… [laughter]
So I’m literally having people coming into this old-school server room; it’s freezing cold, you’re sitting in there, typing, or whatever, with CDs, and stuff… And it took me probably four hours to figure out what’s going on. And it wasn’t us. But that never helps anyone, usually. So the person came in and I said “It’s not us.” They said “It is you. It’s your software. It sucks.” I go “It’s not us.” I said “It’s the operating system.” And they said “No way.”
Long story short, the CEO of that company called the CEO of Sun and said “Hey, I need someone here in six hours.” So I get to sit around for six hours. They said “You can’t leave”, so I couldn’t leave. I got to go to the bathroom and that was it… And this person comes in and just unloads on me. The same thing probably happened to him; they walk in with a suit… “Here, put this suit on, go to airport” type of stuff.
I said “I promise it’s you.” Or the kernel, sorry. Not you. Blame the problem, not the person. And he was yelling at me and yelling at me. So in that six hours I had to wait for him to show up, I wrote a program. And all the program did was it said “Hey, find out where the interrupt handler is, and then on every other core run a busy loop, meaning you totally take out all the other cores.” So our software was running, it was running really bad, and I said “Watch this” and I go “Click!” and all of a sudden our rates went up. Not big, but they went up pretty good. And he’s like “Oh…!” And I control-C-ed my app, and then – of course, he goes “What did you do?” I said “I just pegged all 20 CPUs except for” – or I can’t remember how many they had; they had a lot. It was the most expensive Sunbox you could buy.
So long story short, they had this weird thing where they were finitizing network interrupts and scheduling us on the same ring, so we were just sitting there, waiting for each other non-stop all the time. But once you did that, the OS goes like “Crap, I can’t do that” and you’ve gotta move him somewhere else.
And we had a good laugh at the end, but it was a tense 12-14 hours. So I remembered that, and said “Hey, even if it’s not your problem, show up, own it like it is your problem. And when it’s not your problem, remember it could have been, so be nice.”
Right. That’s a good lesson.
Yeah, that’s great. Actually, it’s a shame that one wasn’t in the show.
I know, yeah. That would have been a good one. Maybe we can splice it back in.
Our transcripts are open source on GitHub. Improvements are welcome. 💚