Mihai and Ashley join Jon to discuss data streaming. What is it, why is it being used, and common mistakes developers make when setting up. They also discuss some of the tools in the ecosystem, including Benthos, a tool created by Ashley Jeff’s to make the plumbing part of data streaming easier to get right.
Featuring
Sponsors
Teleport – Teleport Access Plane lets you access any computing resource anywhere. Engineers and security teams can unify access to SSH servers, Kubernetes clusters, web applications, and databases across all environments. Try Teleport today in the cloud, self-hosted, or open source at goteleport.com
LaunchDarkly – Ship fast. Rest easy. Deploy code at any time, even if a feature isn’t ready to be released to your users. Wrap code in feature flags to get the safety to test new features and infrastructure in prod without impacting the wrong end users.
Equinix Metal – If you want the choice and control of hardware…with low overhead…and the developer experience of the cloud – you need to check out Equinix Metal. Deploy in minutes across 18 global locations, from Silicon Valley to Sydney. Visit metal.equinix.com/justaddmetal and receive $100 credit to play.
Notes & Links
- Benthos - a data streaming tool created by guest Ashley Jeff’s.
- Materialize - a tool for making data streams with sql queries.
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Hello everyone, and welcome to Go Time. Today we are joined by Ashley Jeffs and Mihai Todor to talk about data streaming and Benthos. Ashley, how are you?
I’m good, thanks for having me on.
No problem. Mihai, how are you?
I’m not too bad. Thanks for doing this.
Alright, so Ash described himself as an open source developer, with projects typically written in Go and fronted by unappealing mascots. His main focus is the declarative stream processor Benthos. Ash, what is the mascot for Benthos? Am I allowed to ask that?
Yeah, you may ask that… It is a blobfish, but I normally just refer to it as a blob, because it’s not – I think it’s deviated quite a lot from what a blobfish actually looks like… So I just call it the blob normally, and I call Benthos users blobs.
Okay. Mihai described himself as a seasoned software engineer focused on cloud computing, scalability and open source Go projects. In his spare time he is trying to study bioinformatics and is helping people organize community events such as the C and C++ Dublin meetup group. So why not a Go meetup group, Mihai?
Yeah, we had one in Dublin, but for some reason there weren’t that many people joining it, sadly…
So you had to go back to C?
Well, then everything went online, and then there were a bunch of established groups that were sufficiently popular and acquiring all the speakers, and such…
It sounds painful. Alright, so today we’re talking about data streaming, and I wanna just start sort of at the beginning and just talk a little bit about what data streaming is, why it might be useful, and then we can sort of dive into a little bit more stuff, like tools we could use for it and use cases. So does anybody wanna take that question, what is data streaming?
[04:01] I can probably give that one a try… So it’s probably interesting to put it in terms of what’s different about data streaming from event sourcing and event-based systems, since you’ve already had an episode on that… In event sourcing systems or infrastructure that’s kind of built around events you’re essentially using these message queue systems, and passing instructions asynchronously… And that’s what they are. It’s like a message or record, whatever you wanna call it, but it’s kind of an instruction of something to do in order to have your platform operate… Whereas data streaming is a completely different sort of paradigm that’s ended up joining with event sourcing around the same tooling now… But it’s pretty much just – instead of sending instructions around, you’re sending data that is sort of an asset to your company; it’s kind of a product, it’s important to you as an asset that you wanna keep around long-term, usually, rather than just something that’s kind of temporal, that you flick around the platform a little bit and then it sort of fizzles out once everything’s dealt with.
So the tooling used to be what we’d call batch processing-based, where you’d kind of have like a data store somewhere that’s sort of the permanent destination of all this information… And you would kind of populate it on a schedule; so you’d pull data from somewhere, dump it in this thing, and you would then use that thing to query the data and do interesting stuff with it. And what’s kind of been happening gradually over the last ten years or so is tooling has built up around that to enable a more real-time kind of streaming architecture with the same kind of data. So the data is still treated like an important asset, it’s still a thing that you wanna keep long-term, but the way that you’re moving around a platform looks a lot more like event-sourcing systems now than it ever did before. But the kind of difference is in how important that data is to you as a company. It’s something that you wanna keep around potentially for maybe months, maybe indefinitely you wanna keep this data around.
So you might be flushing it in real time through a platform and doing interesting stuff with it in real time, but it’s very important that the data isn’t just lost or eventually it gets persisted somewhere, maybe multiple times, with different versions of it, that kind of thing.
So can we put this into something concrete, like what is an example of where you’ve seen data streaming being used?
Do you wanna take that, Mihai?
Yeah, I can go… So one use case - let me give a bit of background. I was working at a company called Nitro, and we were ingesting a lot of click data. And I saw this thing called Benthos, and I was like “Man, I really want to use this somewhere.” And that was an interesting use case, where - you know, we get all these people who have a desktop app and they click on various things in the app and you send all those click events to the server… But you don’t really want them to end up being stuck there for a while, and then you do some Bash processing on them. You might want to have real data analytics, where you look at them in real time and you update a bunch of graphs, or… You know, you can do many things with them.
Maybe a more interesting use case was at another company where I was dealing with lots of audio data in real time, where you have a big call center where people are talking with customers, and you want to give them some hints on the screen. “Hey, you’re speaking too fast. You’re speaking too slow. Hey, you’re overlapping.” And there was a system in place that could analyze the audio and using some machine learning it could predict in real time this event happening. And then those events would have to be sent to the operator in the call center, and that was another use case for data streaming. This is a place where I ended up using NATS to kind of receive and send those events along. I could talk more about that a bit later.
So these tend to be cases where actually getting some – like, looking at the data and doing something with it in real time is much more important than “Oh, we’ve got the data and an hour later we processed it and realized this person was making a mistake that we could have tried to correct in real time.”
[08:13] Right. It’s also a matter of making sure the events get delivered, so you might want something like at least once delivery, and you want to make sure if the event does end up sent multiple times, you want to have some sort of idempotency, so you don’t confuse the users… And you do want them to be reliable, so making sure that things are up and running…
These cases that I described kind of lend themselves to a situation where you end up – or you need from the use case to actually send those events in real time… But there are many places where people just have a bunch of events from the platform that are generated while this thing is running, and they simply want to replay them back in the same order, or they want to maybe run some analytics on past data, and it’s useful to potentially not send the whole thing at once to some system and make it crunch a lot of data at a single timeframe, when it could be very intensive, it might require a lot of resources… It might be nice to spread it out during a longer timeframe.
That definitely makes sense. So Ash, you were talking about event sourcing, and we’ve talked a little bit about that in a past episode… So how would you differentiate data streaming from event sourcing? Because some of the things sounded similar, in the sense that – like, with event sourcing I think in my mind a lot of the time all those events are still persisted… That’s one of the advantages at least in some of those systems, is that you can replay them… So what would the difference between that and data streaming be in your mind?
To be honest, there’s a lot of overlap, and the tools are pretty much the same nowadays as well. To be honest, it’s kind of difficult to work on something and say whether it’s data streaming or event sourcing anymore… But in my mind, the way that I kind of partition them is if you’ve got a system - say you’re processing a stream of orders from a website; in an event sourcing kind of architecture you would maybe be passing those events around, so that services can immediately act on those orders and do something… So charge the customer’s bank account or something, and then trigger some sort of delivery… And then once that’s all done, you might keep it persisted somewhere to replay later, or maybe feed that into some test system or something… But operationally, you’re kind of done with that message. It’s over. It’s got a lifetime. Whereas in a data engineering context, you might be processing the same feed; it could be orders on a website. But what you might wanna do is something like analytics on top. So you might be interested in “Okay, over the space of an hour, how many orders do we get from people who own two dogs, versus people who own a cat?” And maybe you’re gonna use that to drive things like cat campaigns, and stuff.
So you wanna have some sort of analytics built on top of that data, and you’re treating the data like it’s an asset, so it’s an important feature of your company to have this thing lying around. And then when you’ve maybe done some immediate analytics, some streaming analytics to infer some important business data, you might then wanna just put it in an Elasticsearch index, or an S3 bucket, and you’ll keep it long-term, because maybe in the future you wanna look back on those orders and work out “What do psychopaths buy at the weekend?”, that kind of thing. Just some random query that your marketing team has. It’s not always analytics… There’s a lot of analytics around this stuff.
Okay. You guys had mentioned idempotency, and I know that comes up a lot in programming. For anybody who is new to programming, I feel like that’s something they eventually have to learn to handle, because it seems like almost all modern architectures are getting to the point where you have to be able to handle the same requests coming in multiple times, and like you said, not charging this person’s credit card multiple times, because that would be pretty bad… Are there other mistakes or things like that that people can make when they’re using a data streaming system?
[12:04] I’m resisting the temptation to just roll my eyes infinitely… In queue systems we tend to have two main delivery guarantees with those. At least once, and at most once. At most once is kind of similar to just writing over a Go channel, where you’re pushing something through it but you don’t really care if it’s been delivered. And that works in a process, because you do know it’s been delivered, whereas in networking you have no idea… Whereas most systems are built on at least once delivery guarantees. There’s exactly once, which Daniel in the event sourcing one said very well, it’s basically snake oil, and I completely agree with that… But then at least once systems are very rarely at least once. I’ll kind of explain what it means.
At least once as a paradigm - you’re basically saying that in the event of failures, so any sort of networking problem, you will err on the side of delivering a message multiple times rather than zero times. So the way that that’s normally implemented is with some sort of acknowledgment system. So I send a message over a network, and I expect something to receive it, something then receives it and then it doesn’t end there; they send me a message back to say “Yes, I’ve received this message.” And all modern queue systems are pretty much built on something like that. Kafka and most queue systems that call themselves, sorta, streaming work slightly different, in that the acknowledgment is you kind of like checkpointing where you are in this sort of logical queue. Because the queue is there permanently; it’s not as if a message disappears, so you kind of remember where you are in that queue at any given point, so if you restart, you go back to where you were.
And if you follow acknowledgments, so if you’re consuming from an at least once source, so you’ve got a queue system that uses acknowledgments - it could be RabbitMQ, it could be Kafka, it could be NATS… And then you’re writing your data out, so say you have like a middle component in a pipeline - if you’re writing data out onto another at least once queue system, it would make sense for you to call yourself at least once. It’s kind of a misleading term, because you’re using it at least once on this end, you’re using it least once on that end, you’re using this acknowledgment system… If it’s not at least once, then what is it? But the reality is at least once I think is kind of misleading, because that doesn’t mean you’re not lossy. It doesn’t mean that there’s not circumstances where you might drop data under certain circumstances, and you can – you know, there is ways of architecting your service to act that way, but that’s not how most services that use at least one source of syncs work. A lot of them will behave in a way that under certain circumstances they will lose data, but I guess one of the issues is any system that follows these rules generally, even if they are potentially lossy, will look exactly once in a normal operation. It’s not until you hit edge cases where maybe all the services downstream have stopped, so you can’t send the data anywhere, and you run out of memory, and stuff like that. Or maybe it’s crashes, disk corruption, stuff like that. I mean, I could talk for hours about what those edge cases are… It depends how much you wanna dig in.
I mean, that’s up to you… I definitely think you’re absolutely right, in the sense that – like, especially when we’re testing code or just writing locally, we have this idea that we can go to our browser and visit some page that we’re building and be like “Oh yeah, it works.” Like, when there’s one user, and the database is local, and there’s no latency, and all these things… And then when you push it to production and actually have real memory limits and high usage and things like that - that’s when all the mistakes come out… So I’d say it’s challenging there… So how would recommend people avoid some of those mistakes of becoming lossy?
I’d say that the most common one is the idea that the acknowledgment happens when you’ve received the message. So you’re reading from an at least once queue system, and you’ve got a message, logically, you can see it, you’ve written your code so that you’ve got this payload… And then you acknowledge it. That seems like quite a sensible thing to do, is you’ve received it, so you acknowledge receipt of the message, but the reality is you’ve received the message but you haven’t finished with it yet. You’re not done with that payload.
[16:14] So if you acknowledge the message and then immediately your service crashes, that data is gone for good, unless some poor operations person waking up at 3 AM knows “Oh, I need to go and chase that data up and make sure we can recover it in time.”
A really common one is say you’re reading from Kafka, you’re committing offsets and messages as you’re consuming them… Most client libraries will make it their business to make it easy for you to have auto-acknowledgments, so as you consume messages, as you receive it, the offset gets marked, and then it gets committed on some sort of cycle. And you do stuff with that message; you could be doing some business logic and then maybe you hit some of the services, or something… And then finally, if you want to preserve that data for some other services downstream, you might write that data onto some of the queue system. It’s a very common pipeline approach of daisy-chaining all these services. And if your service is doing auto-acknowledgments, that’s gonna look like an exactly once system most of the time. And then if one day an operations person wakes up from an alert at 3 AM and they’re looking at their graphs and they see that the service has restarted like 50 times in the last hour, they’re gonna freak out if they see that the data going into that service is more than the data that came out. If you get the aggregate of, say, like an hour or something… Those kind of edge cases that basically are baked into Benthos is what I call kind of like operational simplicity, where it tries to plug all those holes as best you can. I mean, it’s like saying you’ve got a perfectly secure system. You can’t absolutely guarantee that. All you can do is follow what you think are best practices.
So the idea of operational simplicity is you try and follow at least once delivery guarantees as much as you can, and try and plug those gaps. And the idea is that if somebody wakes up at 3 AM because of a restart, or a service crash, or a disk corruption or something, they’re not worried about “Oh –” They can focus on fixing the problem and then not have to worry afterwards about “Oh, now I’ve gotta chase up however much data we might have lost in the last hour or so…” But it is what it is.
So when you’re setting this up – to make sure I understand this correctly. Essentially, you’re saying when the data comes in, you need to process it before acknowledging you’ve received it, so that that at least once delivery is actually held true throughout the whole system. The best analogy I can think of in like another system that wouldn’t be data streaming might be like if you wrote code that was supposed to write to a database, and somebody passed that data in, and you immediately sent back a message saying “Yeah, it was good”, and then spun up a goroutine that went off to actually try to write the data - it’s very likely that could have an error or something… So is that kind of the same – not exactly the same, but a similar analogy?
Pretty much. I suppose the reason why I kind of feel like these terms are sort of misleading is because it does look – when you’re looking at the protocols and stuff from a beginner’s approach, it looks as though you’re supposed to acknowledge messages when you get it. That’s what the protocol looks like. It’s a very intuitive thing to think you get something, you give something back, and now I’m gonna continue my journey processing this thing, and then I’ll send it on downstream.
But if you wanted to do it properly, which - I’m not saying everybody does need to do it properly; I’m just saying that people should probably be aware of the fact that you ought to do this thing if you definitely don’t want any lost messages… What you should do is you should read a message and then wait to acknowledge it until it’s gone all the way through your service and has reached some destination. That could be that you’ve intentionally dropped it, because you’ve enacted on that message and you’re finished with it, or it could be you’re passing it on somewhere else, but you don’t acknowledge it until it’s gone somewhere.
[19:59] The kind of exception to that is if you’ve got a sort of buffer in your service; you might have like a disk-persisted buffer. That’s how a lot of log aggregators work Logstash and such they tend to have like a buffer which is used to temporarily store the data while you’re processing it. That gives you a bit of resiliency. And that’s true, but then if you’re bothering to set up a Kafka with redundancy and disks replicated all over the place, why then have a single disk as a point of failure and losing messages or not? So I don’t think disk buffers really have a place in modern architectures, unless you don’t care.
Break: [20:38]
So for designing them this way, so that you actually act on the data before you acknowledge them, I guess coming from a naive perspective, I would think if you have really long processes of some sort, you need to take a lot of act on the data - could that present other issues, where like Kafka’s trying to send the same message to multiple people, thinking it was never received?
With Kafka – there’s different problems with that. With RabbitMQ - yeah, exactly. Stuff like RabbitMQ, or I think maybe NATS as well, but a lot of queue systems, especially the cloud services, have a lot of mechanisms where if you take too long to acknowledge something, it will assume it’s lost and it’ll requeue it. So if you, say, have 30 seconds to process something and you don’t acknowledge it for 30 seconds - because that’s just how long it takes your service to run - then you have a problem, because then you’re essentially increasing the size of your queue potentially indefinitely, because you’re just not following what they consider to be best practices, which is to extend… What you should be doing is you should be extending your lease, I suppose you could call it. They’ve all got their own terms of this sort of stuff, but any system that will automatically requeue a message usually has some mechanism for temporarily saying “Hey, I’m still working on this, by the way.” Or at least you’d expect it to have a kind of like finger-in-the-air guess as to the maximum amount of time that you’re likely to be processing a message, and then you just configure it to not requeue things… But that’s less ideal, because that doesn’t take in to account things like back pressure.
But yeah, there’s definitely issues with doing it. There’s lots of weird things. But then the err is on the side of duplicated messages and everything kind of grinding to a halt, rather than data being lost silently… So in my opinion, that’s the better option, but it depends on your system.
I assume your system grinding to a halt is also way more obvious than a couple of messages getting dropped and nobody realizing it until they go back later and look.
Yeah. So Benthos loves grinding to a halt. That’s its default, basically. If something doesn’t look right, I will stop. You will have to tell me what to do. So if you’ve got a message that you just can’t send, maybe it’s too big and you just can’t send it to Kafka, then it will just wait; it’ll say “Okay, I can wait all day. Come and tell me what to do.”
So we’ve talked about a couple different tools… For somebody who’s coming into this and they don’t really know a lot of the tools, can you sort of explain what some of them are and why they might be used? Like, we’ve talked about NATS and Kafka and RabbitMQ, and you’ve mentioned Benthos… But how do they all work together?
[24:00] So those are the queue systems; you’ve got Kafka, NATS, RabbitMQ, all that stuff… There’s lots, and they’ve all got their specific use cases and operational complexities to factor in. Then you’ve got stuff - what happens on top. So there’s things like Spark is probably – I think if you’re talking about data streaming, Spark is gonna come up. And then there’s similar systems to that, like Flink I believe is pretty much the same thing… And cool stuff like Materialize, ksqlDB… Those are tools that will essentially solve a data engineering problem on top of that stream, and it’s usually around some sort of like aggregation in real time of the data.
So you imagine - with a dataset you used to have a fairly static collection of messages, and you were used to doing queries like “How many of these people are happy, proportionately?” And that’s a fairly simple task, because you’ve got static data, so you could put it in a database, or whatever. But when you’ve got a streaming dataset, something’s coming in in real time, it’s a lot more complicated now to give you an answer if you want it in real time… And that’s the whole point of a lot of these people setting these tools up. So what it ends up being is a system that kind of sits on top of a queue system, and you essentially give it some sort of aggregated question to answer. So it could be like a rolling count of how many people made purchases in the last hour, or something, versus leaving the website… Stuff like that.
They’re tools that are programmed, like you build them, but the way that I see it - they’re kind of similar to machine learning tools, where it’s not as if you’re writing a real program; you’re kind of using code to describe what sort of aggregations you want, and then it’s clever enough to go in and do that. And it’s also the hard problems, like distributed processing, which means – you know, it’s in-memory processing, but it’s on a dataset that’s so big you can’t process it on one machine, so to scale it, you have to do a sort of sharding of the data, and all this crazy stuff. And you’ve gotta do windowing… So if you’re getting back some sort of aggregated number, then it’s gotta be with respect to some sort of measure of time.
Those are the tools that do very, very cool, complicated things. Shout-out to Materialize as well. I’m checking them out at the moment. That’s a system that’s built on Postgres, so imagine you’ve got rolling Postgres queries on a stream of data, which is pretty cool… But then the other side of data engineering is making that data what you want it to be. So I would describe it as plumbing. So you’ve got Spark and stuff, which are making useful calculations on the data, and then your data team probably also wants to do things like take a – maybe you’ve got like a comment on an article, and that’s coming in as a stream of data… And what you wanna do is you want to make that data more useful by adding information on. We call that hydration. So maybe based on the ID of the article that it’s commenting on, you might wanna go and grab the article, and maybe you wanna pull stuff out, like what is the article about? Who does it mention? Things like that. And then pull in user information as well, how many dogs do they own, how many cats do they own…
What you end up with is this much bigger piece of data, but it’s much more useful. So when you put that in an index, or some sort of data store, it’s just better. That used to be something that was done in a batched way, so every day maybe you’d kick off a process that does that… And now we’ve kind of got all these tools that let you do that in real time. So if your data volume is so big that you can’t do it in a batched way, you can do it as this continuous stream of data.
So the stuff that I kind of specialize in is the tooling that plumbs all those different services together, so you can read from multiple streams, you can multiplex them out to different destinations, and on the way you can hit all these different services and mask the data and enrich it with all this different stuff… And then the core premise is that it’s YAML programming, it’s not a language that you have to compile yourself, which means you can give it to somebody that isn’t perhaps as specialized in code, or does code, but they just don’t wanna do that kind of coding, because it’s kind of boring, all this plumbing work… So they would rather just deploy a tool that kind of deals with that stuff for them.
[28:13] And there’s a lot of repetitive tasks, a lot of CRUD apps and things built on top of this stuff… So that’s kind of why I ended up building this tool, is it’s just sort of a general solution to that kind of stuff.
So I’ve gotta ask, don’t you feel guilty about introducing more YAML to the world?
I love YAML. Oh my God, I love yaml. I could gobble up yaml all day. I have to say, Cue is on my radar. I’m loving what I’m seeing from Cue, and I think that’s probably gonna – although I’ve built a lot of tooling into the app, into the program, so you can… You know, it lints files, and it’s got a solid schema, and all that kind of stuff… So hopefully it’s not horrible working with yaml with Benthos, but there’s a lot of stuff that you could possibly solve as well on top, eventually.
I’m just kidding, because you see all the people using Kubernetes and everything talking about how they’re basically YAML developers at this point.
I see a lot of YAML hate, and I see a lot of yaml love… But yeah, there’s definitely a lot of yaml programmers now. I mean, I’m a yaml programmer… I’m not afraid to say that. I love a bit of yaml. Hopefully there’s something better eventually.
Actually, I used Jsonnet a bit recently, and it’s not too bad… I guess Cue is more interesting, but with Jsonnet there’s already a bunch of stuff in the open source world you can use; and it compiles, it’s easy to kind of get code generated from it… It helps with maintenance and drying up your code. You just don’t repeat yourself so much.
Alright, so I guess my next question would be - let’s say I’m interested in trying out Benthos and setting up a data stream… What are some ways that in a common application somebody might actually be able to take advantage of it? I don’t know if there are any that you can think of… Or is this something where like you need to be in a large enough setup for it to be useful?
Oh, no… I mean, I’m using it for all kinds of stuff, and not at really work. I kind of said it at the beginning - what we’re seeing at this point is that a lot of event sourcing tools are becoming very similar to data engineering tools, and they’re all kind of crossing over… But Benthos is super-general, and because of the types of problems that it solves, you can end up using it for all kinds of stuff.
Some of the stuff I’m using it for is I run it and it hits the Homebrew, Docker Hub and GitHub APIs to get the download data from Benthos… And then it pulls that down, and then it can send it to me, however I want; you can send it to Discord… Discord is an output now. I built a Discord bot with Benthos, and there’s a little cookbook on the website so you can build your own… Or you can do it on the Discord and interact with it yourself. But yeah, you can do all kinds of stuff. You can hit HTTP APIs. Say you hit an HTTP API, or maybe you consume tweets, and every message that you receive, maybe you hit some other APIs; maybe you put it into a database, or something, and then you can mutate it in all these ways. It’s got mapping language, it’s got all this kind of general-purpose tooling for manipulating data. It’s totally agnostic to what you’re using, so you could be sending images around, it doesn’t really care. Or it could be JSON documents. And then you send it to somewhere… You could be using it to just populate Grafana dashboards, so I use it for that as well. I just pump out Prometheus metrics for various things… But yeah, if somebody wants to play around with it, there’s lots of ways of using it for very boring tasks… Because that’s what it’s for, really. It’s for very boring, basic things.
So when you mentioned the Discord bot, I assume it’s gonna be something set up where Benthos handles the stuff of getting the messages from Discord and sort of streaming them to you, and you essentially just have to make whatever reaction or do whatever you wanna do regarding that, or how would that look?
I don’t wanna give out too much of the special sauce, because I don’t wanna ruin people’s interaction with the bot. It’s magical, as it currently exists. But basically, it’s reading a continuous – so the input is just pulling the Discord API for messages. And then what I’ve got is in the Benthos YAML config format I’ve basically got a load of – it’s basically a switch case expressed in Benthos land, where you can have these little mapping queries to dig into what the message contents are. So it’s got some pre-canned responses, so you can do things like /joke, and it’ll tell an awful joke, and /roast and it will roast you…
[32:16] [laughs] Don’t do it.
And then there’s some special responses it has, particular commands… But then the other one is – it also reads from a separate channel that’s only visible to me, and I can type messages in that channel and it will echo it into the general chat. So it basically acts as my voice. It is a stream. You can think of that deployment as a stream of data, because it’s reading from a Discord channel as a continuous stream of data… And then it’s writing a stream out which is spewing messages out into the general channel.
So it fits the paradigm. It’s not data engineering, I don’t think, by most people’s standards… But you can use the same tool for this stuff, because at the end of the day, all you’re doing is moving data around, and manipulating it in some ways… So it just kind of fits in a lot of the use cases.
So you’re saying if I wanna pretend like my company was really big, and I had like ten support agents, I could just set it up so I could just have my own channel privately and have them respond like they’re a different person?
Don’t give away too much on the stream… But yeah, you can do stuff like that. You can DDOS people. I’m pretty sure I’ve DDOS-ed people accidentally with it.
“Accidentally” being the keyword there… Disclaimer…
[laughs]
Totally an accident.
Can we cut that bit? [laughs]
It’s one mean bot… It’s very, very mean to you. [laughter]
It sounds pretty –
But actually speaking of very mundane tasks - you know, imagine even large file transfers. If somebody has a whole bunch of data in a legacy data source and they want to put it in the cloud, or do something with it, like gigabytes/terrabytes of data, you can use Benthos for that and it works pretty well. The thing to keep in mind is that it shouldn’t aim to put 50 gigs in memory and then transfer it somewhere else. That just doesn’t work. So what we end up doing is chunking it. Right now we’re just using an arbitrary chunk size of, say, 50 megs, or whatever… But it kind of has downsides as well, because if it’s binary data and you just chunk it like that, then you can’t really do much with it while it’s in flight… Whereas if it’s some sort of structured data, like let’s say JSON or CSV or whatever text format, then you can also potentially profit from the fact that it’s in-flight, you’re transferring it, but you’re also modifying it. Maybe enriching it from some other third-party source, or modifying various issues in it that might help people who are working with it later on to do their work better, or just supporting tasks that are very much needed in big companies.
I’m always surprised and scared by some of the things people are doing with it.
Yeah. Right now a bit of my work is adding adapters and such, so we can plug into various legacy sources and have those stream to other places…
Something to mention - it’s got a super-cool plugin API, so you can write your components in Go pretty much using the same API as the native ones do.
Break: [35:15]
Plugins are always like an interesting topic, because I feel like in different ways people have sort of implemented them differently… At one point we’ve talked with Mark Bates about how he did plugins for (I think) Buffalo, and I think he’d gone through like two generations of it to try to figure out what made the most sense, and I think it ultimately became like a single-method interface that you just had to implement, or something…
It’s very similar.
Okay.
But yeah, you have to bake it in. There’s no dynamic plugins to load at runtime; you have to compile your own version.
Okay. Is there anything else about Benthos that we should know at this point? I’m coming up with a blank as to what to ask you next, and I feel like you would know… What are some interesting things people would like to know about it if they’re gonna give it a shot, or try it out?
Just look at, literally – the thing about it is it’s pretty much just… It’s not gonna be as dynamic as just writing code. If you wanna do something, then do it. I’m not a big user of frameworks, and stuff… In my opinion, if I was gonna write a bespoke service for reading from a particular thing, I would probably just use the direct client libraries, because I just feel like that’s usually the Go way, and it’s what I prefer… But I think that when you see what the config looks like for certain things, you imagine you can read from three different queue systems, and do some mapping on the documents conditionally, and then you can write it out to several different places, multiplexed by the contents of the message… And that’s like 20 lines of config. All you have to do, for most people, is show them what the config looks like and they’ll know whether or not it’s something that they’re interested in… Because I think if you don’t have to deal with a lot of this stuff often, it’s probably more fun to just write the code and not use something like this… Whereas if you’re sick of solving the same problem over again – like, we’ve written a RabbitMQ consumer that writes to Kafka and removes the field foo, and then we’ve got another service that reads from NATS, and then it writes out to an S3 bucket, and it tar gzips the files… You know, if you’ve written that same app a million times and you’re getting fed up with it, then it’s for those people. It’s people who use that kind of – people are in that space, and they’re sick of working on the boring junk that they’re being asked to do by their data science department, and they just wanna work on fun stuff.
Okay. So basically if they’re in a position where it would make a lot of sense, they’ll realize it, because they’ll be like “I’m so tired of doing this that I wanna throw my computer out the window.”
But a thing is as well - I feel like a lot of people just don’t realize that there’s tools out there that can make that stuff easier. I think that’s the problem we’re kind of in now, because it’s sort of organically grown. I made it of a kind of defensive position of “I wanted that tool for my own purposes.” And then the more people use it, and then it kind of grows organically enough… Not really marketed that heavily beyond just putting really whacky stuff out on Twitter. So I’ve not done a massive job of putting it out there. But I felt like a lot of people just don’t realize there’s tools like that, and they take a while to trust it as well… Because the main selling point of this product is that it looks after your operations people, and prevents them from having massive panic attacks at 3 AM. They just have a minor panic attack. And it takes a while to get people to trust that…
[39:20] Yeah, trust is important. Like, I’m talking to a bunch of bio-informaticians and they’re not aware of this, of course; they have their own ways of doing things… And just getting them “Hey, there’s this nice blob lang DSL that makes a lot of this work that you’re doing very easy. Imagine there’s a bunch of JSON APIs out there that have a huge volume of data, and they might want to extract stuff from them. Maybe query something for a certain gene, or do some sort of cross-genome analysis, or whatever… In those cases - yeah, you can write your own bespoke tool, sure. Or you can use something that is more traditional, like Spark, or whatever… But you can certainly explore other tools as well, and that’s something I’m trying to promote, and hoping to see some users there as well. Also, this is open source and free, and by far this is one of the most responsive projects out there… Sending a PR, getting somebody to review it in a few hours - it’s not that common.
It’s like that for now. Until I get bored. [laughter]
Alright, Ash, do you wanna start with your unpopular opinion that might upset everybody, or the other one that’s more related to Benthos?
I’ll go with the one that is gonna make everybody shut this stream down immediately… So people who vote on Twitter polls are losers, and they should get out more. Nobody cares about your opinion, it doesn’t matter.
That’s harsh. I don’t even know where we go with the discussion for that one.
Harsh but true.
I feel like hiding under the table now.
Do you vote into at the polls?
Sometimes… [laughs]
Has anybody ever cared about your vote? Nope.
Sometimes you just wanna see the answer. See the aggregate data.
Sometimes you just wanna see the answer… Yeah.
I just like to ruin their stats, throwing a curveball in the…
Can I just clarify? Because I think there’s a lot of people upset out there who need not be… If you’re voting in Twitter polls because you want to see the answer, you’re not a loser. You probably do have a life, and you don’t need to get out more.
I don’t know if that necessarily helps that much… [laughter] Alright, your second unpopular opinion. Can you share that one?
Yeah, so as an open source author, I think I’ll probably get to say this, but I kind of feel like there’s been a lot of noise… Well, not noise. Meaningful articles and meaningful stuff out there, people who kind of write open source projects on the fact that businesses kind of rely on them, and they don’t contribute back, and it’s not a particularly healthy ecosystem… Personally, I think open source is a form of business, and I think that a lot of the discussions around open source ethics and how we should model things I feel like doesn’t kind of follow that logic…
I feel like as software engineers we tend to see businesses as rapid-growth funded, massively huge corporations that are designed to dominate the planet… But the reality is most businesses are very small, maybe just one-person, two-person organizations. Maybe they’re a dry cleaners, maybe they’re a pie shop… And I feel like open source is very similar to that, where you’re not necessarily making money directly at this point; you’re just trying to grow something because you enjoy working on it… And there is a prospect at some point of maybe making money from it, and turning it into a living, and growing it, but I feel like we don’t really treat open source developers like that. We tend to treat them like they’re already this big corporation that I can use as much as I want and not give anything back… Or we see them as kind of like charity cases.
GitHub Sponsors is great, I feel like that’s a great avenue, but I don’t feel like it’s the only one. I don’t think we should be treating everybody who’s working in open source as if they’re a charity case and that we need to rescue them necessarily. I feel like we should be seeing them as people who have lifestyle businesses, and maybe it’s gonna grow, maybe it doesn’t… I feel like there’s a lot of projects that are very akin to maybe a niche burger joint in San Francisco, and then overnight they’re a social media sensation and they’re feeding the entirety of San Francisco, and they just can’t cope with it, and the business that they’d really enjoyed working on is no longer enjoyable… But we wouldn’t say the solution to that is to throw donations at them. The solution is – I don’t know what the solution is. I see open source as kind of like a Business Lite.
[44:03] I mean, you could easily look at it as open source is by default an unprofitable business where they’re just selling everything for free. If you view it that way, it’s – I mean, I completely agree with you. I think most people who build open source projects have this idea of - at least in their mind, they wanna do something to make it profitable enough that they can at least continue to work on it as long as they’ll enjoy it… But at the same time, I completely agree with you that the community doesn’t always like to acknowledge that.
I remember when Caddy changed their licensing, there was like a huge outcry over that. The code was still open source and everything, but they were already talking about forking it and everything else… And I completely get why they were doing it. They were like “We have this big, popular thing and we don’t make any money off of it. Or not enough to sustain, really.” So in that sense, it is kind of weird that we rely on it so heavily, but then get so weird when people try to treat it remotely like a business.
I guess I’m also – I’m on your side on that one, just because I’ve done a little bit of open source stuff… And of course, I teach programming, and I’ve kind of had to base it around as similar model of like some things are free, and then some things are paid… And I feel like that’s one model that works for open source, but it definitely doesn’t apply to all open source. I don’t know if you can do it with Benthos or not, I really don’t know, but there’s some open source projects where it’s really hard to take away and make something paid without completely ruining the product.
Yeah, yeah. Okay, maybe go with go with my first unpopular opinion, because that one maybe doesn’t seem that unpopular.
Well, that one - I think it’s gonna depend, because I think people who have written open source will probably resonate with it, and people who have not written open source might be like “No, they’re not businesses. They’re supposed to give it all away for free as like a humanitarian act, or something.” I don’t know, I have mixed feelings there with some people… Because people generally tend to expect the world for free.
But I feel like as well there’s a lot of open source maintainers who don’t like the idea that it’s kind of a business because they don’t like the idea of these super-funded mega-corporations and exponential growth trajectories… But that’s just a small business. You wouldn’t go to the dry cleaners down the road that’s like a two-person family-owned business and expect them to get VC funding and then hire a hundred people in a year. They’re doing it because they enjoy that. That’s their thing. They enjoy the process, and their goal is to make a living; it’s not to grow and dominate the planet with their superior dry cleaning.
I mean, it’s probably hard in the sense that the dry cleaning example - they can choose not to open up new locations… Whereas like open source - you can suddenly have a million users and not know how to handle it… Because Mihai, you mentioned it - Benthos PRs are answered really quickly, but at the same time there’s a lot of projects where people get overwhelmed and people don’t structure PRs correctly, and after a while it’s just like “I’m spending 10 hours a day just trying to correct things when people are submitting bad reports”, and things like that; then it gets exhausting, and there’s not really a filter there… Whereas with the physical location, you can limit it to be like “We just have one. We aren’t gonna open a second.”
Uhhh a burger stand if somebody tweets… If somebody gets a good tweet, a burger stand that was perfectly happy one day could be miserable the next, with like a two-hour queue and angry customers, running out of condiments.
You’re right, but there’s also a cap.
Yeah.
At the end of the day, they can say “We’re open from these hours, and when we run out of burgers, we’re out of burgers, and that’s life.” Whereas open source - you could have bug reports pretty much infinitely, as long as there’s people on the internet to find it and file reports.
I guess it’s like Benthos, right? If you hit it too much, it grinds to a halt, because it just doesn’t scale horizontally that much to have a proper center of competence of maintainers who can process all of those at the same time… So I guess things will start stalling at a certain point if the volume increases.
But also, if my local family-owned dry cleaners shut shop one day because it was too much for them, they’re getting a brick through their window. [laughter]
[48:12] You’re, um, very kind.
[laughs] I need my dry cleaning.
So – I don’t think I own clothes that need to be dry-cleaned, but I’ve also worked from home for like 7-8 years, so… I have like one suit, and my wife had wedding season with her friends, and I’m like “I’m wearing the suit all the time”, and then the next year I don’t wear it at all… So during the wedding season I’m like “Maybe I should get a second suit”, and then once it died out, I’m like “No, I don’t need a second suit.”
Yeah, but when you do need it, you need a dry cleaners to get all the dust out.
That’s true. But aside from that, I don’t go dry cleaning very often. But okay, so if you’re looking at these open source projects as a business, can I ask how does that change the way you look at Benthos?
Yeah, my first goal with Benthos was to just get to use it at the place where I currently worked, or where I worked at the time. And the idea that I could solve this problem that wasn’t really being acknowledged at the time, and it was open source, I’ll put my time into building this solution now, since we’re not gonna be doing that as part of our job… But then one day, if they do adopt it, I’ll get dividends by being paid to work on my little thing, and then if I change organizations later, I can bring it with me.
So that bit - I suppose it’s not entirely like a business, but it kind of is, because you’re sort of starting a side company hoping that your company is gonna start using them, basically. And then all I’ve really wanted was to be able to work on it to some extent. And then once that happened, I was like “Okay… More. I want more of that. I wanna have more time to work on this thing, and also still be getting paid.” And then it’s pretty much just like stepping stones, stepping stones, doing that. It’s not always immediately obvious what the next step is, but you’re pretty much building a – if I was braver, then I probably would just quit my job and worked on it full-time, with no pay… But also, I’m not like that. I like to have a living and take things slowly. For me it’s been like a gradual thing. But you can consider it as sort of like, I guess, moonlighting a separate company; it’s just not obvious where the money can come from.
Did it cause you to think about it in the sense of like “How am I going to make money?” Did it cause you to think about putting certain features behind like a pay gate, or something like that? Or was it a little bit different?
Obviously, this business model is for open source where you’re able to scale it and grow it massively in a short space of time. So with stuff like that, with that kind of growth, you need to have something specific to make that money from, you need some sort of channel… Whereas I think if you’re happy just being on your own and making potentially just enough to get by yourself, then you don’t need to do any of that stuff; you can just do support.
It’s not the best way of funding an open source project with this support model, because it does kind of put you at odds with the project’s goal. The project’s goal is to be easy, and support is kind of the opposite of that. But if you’re only interested in keeping yourself going, then there’s not really any conflict, because I can make it incredibly easy. There’s gonna be a few people who still want some extra stuff on top, or help with it.
So I’ve obviously have to think over time about how I would fund it going forwards if I wanted to expand it… And obviously, I’m in that mindset now, because that’s kind of what the next steps are. But to be honest, I’m quite happy just carrying on like this. I mean, if somebody told me “This is your life now” - basically, I’m essentially just a consultant around the project, and somebody said “You’re never gonna do anything other than this your entire life”, I’d be like “Alright. Okay. I’d like to retire at some point, but it’s fine.”
[52:00] When you mention support, I think there’s at least some of the models I’ve seen that seem to sort of go with the support, but not quite exactly support… What’s coming to mind is Tailwind, which - I don’t know if you’ve ever used it, it’s a UI CSS framework. But they have like a paid version, which is like pre-built components, and I guess one of the ways I could see people supporting is kind of like – you have it on the website, actually, for some of them; the cookbooks, of like “Here are different ways to build some certain things”, and some pre-designed ones, which to me is well beyond like “Here’s how you use it and here’s the basics of getting started.” You could actually have pre-built things like that, of like “Here’s a really common setup we see, and here’s some code that already works for it all, you’re welcome to use it, you just have to pay for it”, or something.
So I guess why I was bringing that all up to say was I view that as kind of a support model, but it’s not really like a “making it harder to use” one, it’s more of a “We’ll do some work for you and provide it for you, just to make your life easier.”
Yeah. I could definitely get by with stuff like that. I mean, I wouldn’t bet a thousand-person company built around that model… You could do it, but I just don’t think it’s like the most ideal setup for that kind of business. But for one person, you don’t need to go that far. I just need to get by. I just need to be able to buy my magazines, and get my dry cleaning every week, and I’m happy.
There’s always gonna be a need for some sort of custom adapters, and such. There’s always plenty of legacy sources, legacy whatever, or existing things that - you know, it might be nice to have everything in Benthos, and then Benthos is like the kitchen sink of all the sources and things… But some of them just don’t belong there. Like, I don’t know, maybe somebody wants a Sybase adaptor… Like, who cares about Sybase?
I’m not putting that in… But yeah, it’s infrastructure as well. Lots of people essentially build a business around it, and then if they want support for it, they’re willing to pay up… Because even if they don’t really need an awful lt, it’s more they wanna support whatever I’m doing, basically. “We want you to carry on doing that thing indefinitely, please, because our business is pretty reliant on it.” But yeah, I’m not struggling, basically.
Yeah, that makes sense. It definitely seems like people are coming around more to intentionally trying to support the open source code they use… And I’d hope that continues, because it seems like something that if it went the other direction, that open source would very quickly just sort of fizzle out as something that’s not really doable.
Alright, I was gonna ask Mihai if he wanted to share an unpopular opinion…
Well, I didn’t have anything specifically related to Benthos, but I’m always happy to rant about the fact that I just don’t accept being grilled in interviews anymore… Like technical – “Go to the whiteboard. Do an algorithm.” That’s just not happening to me. I had plenty of that in the past, it always ended up like some sort of miserable failure, or just came out very unsatisfied out of it… And I took it on myself to just have this very public stance on it and just say “No, I’m not doing it.” If anyone contacts me and says “Hey, we want to hire you.” “Okay, what does the interview look like? Does it have this algorithmic test? Bye…”
I guess my first question would be do you think you could have done that before you were like a seasoned developer?
No, it’s certainly a privilege. Of course, I can do it right now, because I know I can find a job easily without having to put myself through that… But yeah, I definitely sympathize with people who are just starting and they definitely can’t avoid this. I’m hoping that more companies are gonna realize that it’s like a leaky bucket, where you’re gonna get maybe a few people, but there’s gonna be those few people who just don’t fit well to this interviewing style. You’re gonna get some people who just get very nervous, and they’re gonna end up failing miserably, although they are probably gonna do a good job as a developer. That’s my take on it at least.
How does that affect your process for finding – if you’re looking for work or a job? Is it just asking recruiters or whoever you’re talking to what the interview process is like, or do you actually actively avoid certain companies?
[56:02] Yeah, I definitely avoid top FANG companies. They have protected that mindset of interviewing people and they don’t want to change it… And that’s fine, I don’t want to work for them. Or I don’t anymore. I tried, and that didn’t work out well. But you know, it takes maybe a different mindset; you have to keep a very open mind and say “Hey, I’m willing to do other things that are maybe just as good to evaluate my skills.” Some people are gonna be happy to let you code something in your free time, or they might say “Hey, maybe you can go to this open source project and contribute something”, and based on that contribution they’re gonna be happy. Or they might just be very happy to talk about some architecture, and have you design that on the board in front of them… And that’s usually, at least for me, way better than going and writing code… But it’s all about flexibility and finding the people that are flexible, and looking at non-standard sources for jobs.
Many people are just gonna be like “I put up my profile on LinkedIn, I put up my profile on these two other sites and I’m expecting people to contact me.” I’m usually more proactive than that, so I might go on Twitter and stalk a few people who are looking for jobs and see who replies, or there’s a bunch of Slack groups where people are now advertising various niche jobs. That works much better.
That makes sense. You mentioned a couple different processes that you can use to evaluate skills. Is there any that you prefer, or – the ones that you think make you… I don’t know how to phrase this… As the interviewee, are the ones that you really prefer that feel like they showcase your skillset best, or…?
Yeah, so I’m in a position where I get to evaluate people sometimes, and personally, I do prefer to see some code from this person. There’s people who – and probably most developers out there have never really written any open source code, and they don’t have any big project to showcase, but personally, I would still like to see some sort of small example… I’m pretty sure if they have at least ten years of experience, surely, they wrote some small script at home that they would be okay to share, and show something that kind of works, and is decently well written.
I’m kind of particular about seeing well-structured code. Again, kind of to promote Benthos - seeing the well-structured code in there, it just makes me happy. I’m happy to contribute to that code. It has a certain level of abstraction and maintenance and so on that is not easy to see in other places. There’s quite a lot of internal code in many companies where I worked that is just impossible to maintain going forward; it’s just legacy by default, ever since you started. I’m kind of trying to avoid that.
Ideally, if I get to hire people, I would like to hire somebody who pays attention to detail. That’s something that I appreciate in a company. If I see that they care about this when I go interview with them, then that makes me happy.
That makes sense. I know that problem is usually a very tricky one, in the sense that – like, I agree with you that algorithmic interviews don’t showcase a lot, at least for everybody; there’s some people who are gonna thrive with those, and there’s others who are just never gonna do well with those, even though they might be great engineers. But I know there’s also countless people online that I’ve talked to who are working a job where they can’t basically share any of the code from that job, and sometimes the way the agreements they had to sign with the company are set up, it makes it really hard for them to do almost anything outside of that. And I’ve seen some companies that when they do the interviews, they’re like “Well, we’ll do a two-hour paid project”, but then if you’re working somewhere, there’s oftentimes a clause that says you can’t moonlight for somebody else…
Yeah.
So there’s all sorts of troubles there, and it’s not an easy problem, for sure… But I agree with you that being flexible is definitely useful, in the sense that – the way I’d put it is I don’t necessarily think they should be banished entirely; like, if somebody really prefers to go that route, then cool… But I definitely agree that forcing everybody to do it is kind of silly, because pretty much every senior developer you’ll talk to has been like “I haven’t touched that stuff in…” however long they’ve been working professionally, pretty much.
[01:00:12.03] Yeah. I mean, it’s good just to be flexible, and I do see the merit of helping people who just cannot afford to spend more than five hours interviewing, or whatever… And that’s fine. I’m pretty sure there’s plenty of companies who are gonna go with that and everybody’s gonna be happy. Just make sure you don’t make it impossible for people like myself to find a job. [laughs]
Hopefully that doesn’t ever become the case… Generally speaking, I feel like once people have worked long enough, like you said, you get kind of privileged in the sense that it’s a little bit easier to find work, and you usually have peers that you can leverage that helps… But for junior developers, that’s a tough problem. I think I just recently saw a tweet that – I forget who it was, but they basically said if you think if it’s hard to hire senior developers now, wait ten years, when nobody’s been hiring junior developers for ten years. [laughter]
I’m actually surprised how many people don’t even ask – like, you have somebody who just graduated from university and they come in for an interview… They are not asked “Hey, do you have some projects from university that you’d like to share with us?” No. Just “Here’s the algorithm. Solve it. Okay, you solved the algorithm. You’re hired.” Why? It’s kind of a missed opportunity. You get to see more of that person, if it’s possible. If not, that’s also fine.
Yeah… I mean, the worst part for me was in university almost all of my side work was on a programming team which did algorithmic-type problems… So even if I had side code, it was pretty much all that, and it would have been like “Oh yeah, go look at Topcoder, and like the thousands of problems I did in my spare time.”
That’s also interesting.
But it’s also weird, because that code is like – looking back at any of that code, I’m like “This is not at all useful in writing sustainable code”, because when you know your whole thing is gonna be a hundred lines of code, sustainability is irrelevant. You’re just like “Yup, globals are fine. Throw them everywhere.” [laughter]
Yeah… Well, that’s why some companies [unintelligible 01:02:03.19] they have this program where we get people who are really new. Imagine somebody who might have a six-month bootcamp and they don’t have a formal training in CS… And they come in and they’re not really expected to deliver anything substantial, so we give them a lot of time to “You know, just go do documentation. Build something that you like… A kind of thing you can show, or something you can talk about and reason through in detail.” That’s a process of learning. I hope that, you know, we’re not showing them only bad code.
Alright. Mihai, Ashley, thank you for joining me. I guess I probably should have called you Jeff there. Just to mess with you.
Don’t…! [laughs] Thanks for having us.
Thank for having us.
Our transcripts are open source on GitHub. Improvements are welcome. 💚