Johnny, Mat, Jaana, and special guest Stevenson Jean-Pierre discuss serverless in a Go world. What is serverless, what use cases is serverless good for, what are the trade offs, and how do you program with Go differently in the context of serverless?
Linode – Our cloud server of choice. Deploy a fast, efficient, native SSD cloud server for only $5/month. Get 4 months free using the code
changelog2019. Start your server - head to linode.com/changelog
Datadog – Cloud monitoring as a service. See inside any stack, any app, at any scale, anywhere. Datadog is cloud-scale monitoring that tracks your dynamic infrastructure and applications. Plus next-generation APM. Monitor, troubleshoot, and optimize end-to-end application performance. Start your free trial, install the agent, and get a free t-shirt!
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.
- Kubeless - The Kubernetes native serverless framework
- Knative - Building blocks that simplify how you deploy and run functions atop Kubernetes and Istio. On any cloud.
- This tweet from Kelsey Hightower - “In less than 15 minutes I was able to open a new @zeithq account, install the Now cli, create a Go function, link it to GitHub, deploy it, and hit it with curl. 🤯 If this is the direction general compute is headed, count me in.”
- This tweet from Ian Molee - “Watch me code, deploy, and exercise a “serverless” Go function in about a minute, using @zeithq zero-config. In 2-3 years remember @jessfraz told us about #configless in 2019!”
Hello, and welcome to Go Time, the show where a diverse panel and special guests discuss all things Go is known for, including code infrastructure, distributed systems, microservices, and especially today serverless.
My name is Johnny Boursiquot, and joining me today is a stellar cast of character, as Mat Ryer usually puts it, including Mat Ryer himself. Say hello, Mat.
Making her triumphant return to our panel is Jaana B. Dogan, a.k.a. JDB. How have you been, Jaana?
Yeah, good. How are you?
Doing well, doing well. I hope you’re ready for this, because this is gonna be good. Last but certainly not least is our special guest, a serverless connoisseur, Stevenson Jean-Pierre. Sak pase, Stevenson?
Map boule… How are you?
[laughs] Good, good, good. For those of you paying close attention, Stevenson and I are both originally from Haiti, so that’s a little “Sak pase/Map boule” thing right there.
Today’s show is a special one, it’s near and dear to my heart, because I am a fan of using what has become known as serverless technologies. We’ll get into what really that means, and why we call it serverless, although that’s more of a marketing term… But we’re gonna get into that.
Let’s start with some ground-setting. What is serverless technology? Where did that term come from, what is it really trying to relay? We know it’s a marketing term, but what is really the intent? When you’re using serverless technologies, why would you reach out for it? What is it? Let’s do some ground-setting here.
From my point of view, it’s just about not having to worry about the deployment too much. I use Google App Engine a lot, the standard environment, and I like that. In fact, that was the reason I first got into Go in the first place, because to use it, you have to either write Java, or you have to write Python, and I didn’t know either of those… And there was this other little language; it just said Go, with a little Experimental badge on it, and I’m like a magpie to that kind of stuff. I was very interested, and that’s when I first discovered Go.
The promise of App Engine is you write your Go code and you give that code to Google, and then they will make sure they can run it when it’s needed. And that as a developer is nice, because it was never an area for me that I was particularly interested or particularly well-skilled at… So it was nice – at least the promise is nice for me, from a developer’s point of view.
[00:04:03.07] For me serverless is a combination of two things. For me it’s a lot of event-driven work; I consider it serverless when it’s driven by an action being taken, as opposed to constantly just being up and waiting for some kind of request to come in… That’s the compute side. But then you have even serverless technologies coming out on data storage, things like RDS Serverless or Aurora Serverless, so you can have compute, and it goes to Mat’s definition there, where you don’t have to worry about the underlying engine, you don’t have to worry about configuration… It’s just there and ready to scale when you need it to.
For me it’s similar to what Mat is saying - it’s more of like, I don’t have to deal with infrastructure that much. It’s more of like an abstract layer on top. Some of the things are considered just taken care of on behalf of me.
And I think what is the other important aspect is it’s more of like a pay-as-you-go model. You don’t use it, it scales down to zero; you pay as you go. That’s what the definition of cloud should be, to be honest. But this is a really tough topic, because I think serverless became kind of this umbrella term, and I think it means more abstract things… But there’s so many different layers of abstractions, and each higher level is actually more serverless than the lower layer levels. That’s why I think it’s good to say that the less you care about infrastructure, operating and maintaining the infrastructure, and pay as you go, and if it scales down to zero - that’s serverless to me.
So the common thread here is basically not having to worry about managing the infrastructure that is running your functionality, right? Be it compute, be it storage, be it some sort of integration with the event sourcing thing, for example, like in the case of AWS, where they have different things that can trigger functions and whatnot. You as the developer don’t have to worry about the plumbing, the underlying infrastructure. You don’t have to provision instances, you don’t have to do any of that stuff yourself. Basically, you’re really sort of stitching together or linking together different Lego blocks that do certain things, that react to certain things whenever they happen with your environment.
Yeah, I think the promise is “Just care about the business logic, and we will take care of everything else. And these are the fundamental blocks you can use.”
When I first started getting into serverless - that term means different things to different people; there’s even a framework called Serverless, and that’s really not what we’re talking about here. We’re talking about the concept; no one technology or no one framework, no one product. When I first started exploring it, I kept seeing these use cases around, like “Upload an image to S3, and then something creates a thumbnail.” Almost trivial use cases… And I’m like “This stuff is way easy, way super-easy.”
Use cases for me have always been very small microservices things that aren’t even worth spinning up infrastructure to run. It could be maybe like a 50-line script that does some specific functionality. Also, like I mentioned earlier, the event-driven stuff things that are dependent on some event happening before it fires off… Another great use case has been as a just general cron replacement. You always have the issue with having highly-available cron, and having multiple services with the same schedule and not stopping over each other, and you’d have to implement weird locking mechanisms… But by having serverless functions, you could depend on the higher-level timer from the cloud provider, and you could have a single cron source.
[00:08:04.05] And for not so good use cases - anywhere that I’ve had to maintain state, or maintain some kind of cash for speed, and things like that; serverless is not conducive to state, as it actually forces you to be very stateless, unless you wanna go to some network storage… So any of those use cases where long-term state on the app tier is important, serverless hasn’t been good for me.
Steven, you mentioned a few times events, and that really these functions run in response to events. What sorts of things are events? What kinds of things that can happen? What sort of examples are we talking about?
When serverless first started, a lot of the use cases were just pure HTTP request and response style cycles. But then you had these cloud providers kind of plugging in the ability to integrate with their other services. For example, for Johnny’s use case around S3 upload and things like that, now you have events coming directly from S3 to tell you “Hey, something has happened in this bucket. An upload has happened, a delete has happened” and you can now asynchronously take action against it; you have the same thing against any other type of source, where it’s telling you it’s pushing that kind of event to you, it’s pushing that payload to you, instead of you going out and polling and finding out when things happen.
You’re getting a payload that’s telling you what’s happened, and it’s in a scheme, it’s in a shape that you understand from your function, and you take action against that. It makes life a lot easier, and it becomes that glue layer, like Johnny described, where these services are actively telling you what they’re doing, and then you respond to them.
If you think about it from a cloud provider’s perspective, it’s almost impossible not to figure out, like – serverless is so fundamentally important, because that’s the only protocol that you can talk. You need to provide some arbitrary execution environment for some events, because there is no way – you can talk to your cloud provider, but they cannot talk to you. So it’s not surprising that it became so fundamentally useful, because that’s how they talk to you back.
Yeah. In the App Engine, especially now in the latest version of App Engine for standard environment, you basically write your Go program as a normal program; it’s actually package main, and you use the handlers, you use whatever you’re gonna do… And then you ship that to App Engine, and then I think it scales to zero, so there’s nothing running. And then the first HTTP request spins up the instance, it spins up your program, and in theory then you can start replying to those requests… So I’ve tended to use it in that way, of really still just a web service that I’m putting up there. And it might be serving a website, and associated services, but usually it’s all for me been HTTP-driven… So a request comes in, we spin up the instance and deal with it, and then that instance at some point will die.
And actually, if that’s how you think about it… And you have to remember that one instance - requests from one user might go to one instance; the next request from the same user might go to a different instance. So if you imagine this sort of load-balanced environment like that, that has quite a big knock-on effect to certain design decisions about what you build, as you’ve mentioned, Steven… Which we can get onto later.
For me, it’s been really useful to be able to build a website, or a web service, or something and just put it into App Engine and not worry about it… And it sort of just keeps working. If nobody uses it, it’s fine. It doesn’t cost me anything.
I have one gopherize.me, which is the service where gophers can create gopherized versions of themselves using Ashley McNamara’s artwork… That’s an App Engine thing, and that one sometimes does actually get quite a lot of activity, and I’ll pass over my free quota into having to pay for it.
[00:11:52.12] I think that’s a very good point that you’ve made, around treating serverless functions like you would web services; you have stateless computes here, where you don’t know for sure where requests are gonna get routed, and you don’t maintain state on disk, you always externalize the state, because you don’t know what you’re getting… It’s a very good mindset to keep with serverless, because that’s very much the kind of use case you get; and even when you’re not doing something that’s directly HTTP, those events kind of come in in that same style, where you get an event, you get a request, and then you have to do some kind of response, you have to take some sort of action against it, so it more closely aligns with that use case.
Right. One of my favorite uses for using the serverless model is being able to react to things coming off of a queue. I’ve had projects where because some operations didn’t need to be synchronous, it’s not like you had a user sitting there, clicking something, waiting for some sort of response to come back - that traditional HTTP model - you could basically trigger something asynchronously… Maybe a user performs some action, and then you drop some sort of data, some sort of payload onto a queue, and something somewhere is gonna respond to it.
So that allowed some teams really - like a front-end team that is responsible for the user interface, the back-end that captures these events, and then basically dropping off into a queue that another team was responsible for writing functionality that picked up and processed it. Then they had that sort of asynchronous model, and it worked very well, both in terms of decoupling the concerns between what the front-end team needed to do and what the back-end team needed to do, but also in terms of showing a very good example of one of the types of event sources that you can have. It provides a lot of different ways that you can trigger business functionality that goes beyond just the traditional HTTP model.
But one of the things that we ran into - and there’s been other folks who have come out, through blog posts and whatnot, and sort of noticed the same thing as well… We talk about how the costing model for serverless - be it Lambda, or Cloud Functions or whatnot - because you’re not paying for idle, I think there’s this misconception that because so much of the marketing is focused on “Well, you’re gonna have so much savings, because you don’t have something that’s sitting there and just waiting for things, whether it’s being used or not, you can have so much savings, that you can just go haywire, go crazy with the serverless functions and whatnot.”
But one of the things that we quickly realized was that if you are going to adopt the serverless way, which forces you to think a certain way - you’re no longer in the land of monolith, where you have just one big codebase where you can see everything happening… If you start going down that path where you’re like “I have to make my functions very small, to do one thing and one thing only”, and then now you’re firing off this one small function that does that one thing, and you’re constantly firing that off, that could end up actually costing you more, depending on what it is that you’re trying to do.
There was a very good example in a blog post I remember that came out a few months ago, where the unit of work that was being fired off, one per execution of a Lambda function - I think they were dealing with AWS Lambda - basically it ended up costing them more, rather than leveraging Go’ concurrency primitives, using goroutines for example… Whereby in one execution of the Lambda you could actually have multiple goroutines doing work in batch. That way you still have one Lambda execution, but you’re doing a lot more work in there, be it all the work was of the same kind; it was the same type of that you’re doing, so you’re not violating that “should do one thing, one thing only” kind of thing. You’re just batching the amount of stuff you’re doing in one execution. So that ended up costing a lot more.
[00:15:55.08] This is one of the things where if you just drink the Kool-Aid, if you just buy it off the shelf just like that, and you start making everything – everytime you wanna use a piece of functionality you just execute a Lambda, you might find yourself in some hot water. So I’m wondering, what are some of the gotchas that you yourselves have experienced, along those terms…
I personally was thinking that Lambda is like a CGI model [unintelligible 00:16:14.12] It’s just that all the optimizations is just basically – the cold start and the startup time is actually really fundamentally important, if you are promising some cost advantages… And one of the things that I really like about Google Cloud Run is they decided not to go deployment per function; it’s more of like you’re handing off this server, a long-running process… Which still has a limited execution environment, and they can kill the server in 15 minutes, but at least you can bundle a bunch of things… So when you’re bootstrapping the server for the first time, at a cold start, it actually can serve multiple endpoints at least.
But then if some of the endpoints are never going to be used, is it in terms of like memory, and CPU usage - it’s some extra cost, right? There’s always these pros and cons, but I like the fact that they’re giving you the option to bundle things together… So if you believe that some endpoints, or some functions are going to be called really frequently, you can bundle them as one server, and each time you bootstrap, it’s just going to be one bootstrap serving 3-4 endpoints.
Yeah, I think Johnny made a very good point in terms of it’s not just a catch-all silver bullet… So I think in the same way that operators would traditionally decide on the instance size they would use, and things like that, they have to consider their workload. If your workload is 24/7, by the minute you’re doing a lot of throughput, then serverless may not be the right solution, because having a constantly on server of course will help with that, and even having cache and amortization there, you get to trade that off… But if you have a very spotty workload, and you need a good amount of scale and ability to run things in parallel, then yeah, serverless is beneficial… But you need to make sure that you’re doing that math and understanding how much throughput you’re gonna need from a system… And even compare it against just a regular compute instance and see if you could determine what the best approach could be.
Very good point. So along those lines - we’ve been talking about how different cloud providers have slightly different solutions… And there’s some commonality across all of them, but you’re starting to see some deviations. With Cloud Run, for example, from Google, you’re starting to see a differentiation there in terms of what the containerization model is… And correct me if I’m wrong, Jaana, this is your world… So I’m wondering - at some point, somebody’s gonna ask themselves, “Well, okay, every time I have to write a piece of code that is going to run as this serverless function, it’s gonna run somewhere - is it possible for me to write this in a cloud-agnostic way? Is it possible for me to not have to import in some sort of third-party, whatever the cloud provider’s package is, whatever that library is - is it possible for me to just write my functions in a way that I can run them on AWS Lambda, I can run them on Cloud Functions, I can run them on Azure…? Is there any way to have that?” I know there’s OpenFaaS as well, which is that project the other day; it looked very promising.
So there’s all these different options… Is it possible really to write all of your functions in a cloud-agnostic way, and have them be deployed without really having different build pipelines, and different ways, actually having to import different libraries from different cloud providers? How easy is it? And is the cost of creating abstractions worth it?
[00:19:51.25] Can I ask a question…? I was really skeptical about the portability aspect of serverless in general, but in the end what I realized is just like I import a library, whatever, but it’s really a small piece. Then the function block, and whatever - the reusable part is actually just there; you just call maybe like two lines from a third-party library… So it was not truly a big concern, especially if you organize things in a cloud-agnostic way, plus doing [unintelligible 00:20:21.27] at the end.
Again, this is my personal opinion, but we are trying to now reinvent all these different abstraction models that make serverless run everywhere, including your own prem… But I’m questioning, is it really worth it to have that abstraction model, or is it just easier to just switch to those two lines and import a new library and you would be good to go?
Also, the reusability of your functions is one thing, but I think the overall orchestration aspects and the configuration is another thing, which is definitely right now proprietary-based… And that’s another conversation to have. I think it’s easy to reuse your handlers, but how can you just spin up the same environment with similar naming, similar scaling properties and configuration on another cloud provider? I think that’s more difficult.
I like your question though, of questioning the premise of this… It’s a little bit like how we get very excited with the idea that later we could swap in a different database, when we’ve built these right abstractions… But why would you do that? And I’ve especially heard people try and say like “This is a MySQL database, but because of this abstraction, we could put a Mongo database later if we wanted to just switch it…”, and it’s like - well, they do very different things. I feel like we get excited about the possibility of that without really thinking about whether we’re ever gonna actually need to do it.
And that’s the other thing about – you made that point, Jaana, these functionless services are kind of meant to be sort of small and lightweight, and so I think if you are gonna be moving over to a different provider, it’s a good opportunity to do a rewrite of some pieces as well, because that’s something we should probably be doing anyway as good practice… But yeah, it’s interesting to think of that, I think.
I think as the limitations change, you need to consider some of that. Again, ORMs were a thing, but in reality nobody does that, because each time you’re changing your database, you need to almost rearchitect, at least your data layer. So I think it’s natural to ask “Is it really feasible to achieve portability?”
I think to those points though, the handler is rarely the interesting part of the serverless function. That’s just how the information comes in. But what you’re actually doing with the function is the piece that probably ties you to the cloud even more, right? So if you have a handler for S3 events, then you’re probably tied to the S3 API. If you have a handler for some kind of Google Cloud Storage event, then you’re reaching out and doing these other things with the Google Cloud API; so the handler is probably the easiest part to swap out, but all the other technicals that are in your codebase, related to the cloud-specific APIs and things like that, that you’re using to handle the event, are the things that are gonna be harder to switch out… And I rarely find that multi-cloud argument to be worth it in the end.
I remember back in 2012-2013, when everybody was talking about multi-cloud… It’s just a race to the lowest common denominator at that point, because you have to kind of standardize for whatever the lowest common functionality is, and it’s never kind of worth it, so… Having small packages that are easily rewriteable to swap out the vendor (or what have you) sounds like a better approach, because the core logic will remain the same; it’s just the APIs and how you’re getting the data that may be different.
I think this leads us to the next key topic here, which is how do you – in a very practical way, if I’m a Go developer, or rather if I’m a developer who happens to be writing some of the functions in Go, how do I set up or structure my Go project in a way that allows my business logic, my behavior to be cloud-agnostic, yet the entry point, say I have to import some sort of package from AWS or from Azure or whatnot…
Personally, I write my serverless Go projects the exact same way I write every other Go project, and here’s what I mean by that. In a regular long-lived service, I’ll write my package main, function main, my entry point - I keep that as light as possible. I don’t have a ton of stuff going in there. Maybe I’m reading some arguments from the environment, maybe I’m reading something from configuration, or from arguments being passed in, whatever the case may be. I don’t do anything different when it comes to that, with regards to serverless technologies.
And pretty much everything in my business logic - I’m not gonna bring in those third-party dependencies… Say S3, for example - I’m not gonna bring that into my business logic. I’d rather create some sort of local interface for that behavior, for that functionality, that an S3 implementation can satisfy. I’m not gonna bring in a DynamoDB package into my business logic. I’m gonna write a local interface whatever the implementation of DynamoDB I am passing in is going to satisfy.
So to me – I don’t do serverless programming any different than I do any other kind of Go programming… Which I think is the biggest point that I can seek to put forth here - the best practices you know about Go development don’t go away the moment you start doing serverless work. You should strive to abide by those same exact principles and best practices that we talk about for any other kind of Go project. Mat?
Yeah, I think that is a great lesson for anyone that hasn’t got much experience with serverless; I think that’s actually quite a key point there. What we’re saying is that yes, there might be changes in behavior, and you might do things differently in your code in the serverless environment, but those things are good things to do anyway, for their own sake. That’s quite encouraging, because it’s possible – and actually, with App Engine, until the recent release, you did have to do things slightly differently, so therefore you were forced to create some abstractions that you might not be happy with, or do other changes to your project. They changed that now, so as I said, you just deploy its package main; that’s what you’re deploying. And there’s a few things that fall on from that.
[00:28:23.22] For example, Jaana, you mentioned the cold start thing. This is where there’s no instances running, the first request comes in, and it has to do some work to get the instance up and running… And you want that to be quick. You want that to be as quick as possible. That might mean you would defer some setup for certain handlers until later, and things… And I do these sorts of things as well, even though in some environments it might be that I deploy a server once and it’s a long-running server, so I’m not really getting the benefit, but still, I think it’s good practice. So that’s one example.
I use the sync.Once package, with handlers, and that allows me to make sure that I only do the setup the first time a request is called, and only do it once, atomically, so even if you are receiving multiple requests, every request gets its own goroutine, it’s possible that you might be trying to do multiple setups on the same thing, but this would avoid that, with sync.Once. That’s just one example.
I definitely agree. I think serverless, to a certain extent, is just a deployment detail to the code that you’re writing. Even for me when I’m writing serverless functions, I don’t try to do anything Lambda-specific or what have you until the very end, when it’s almost time for me to deploy it, because even locally, I treat my Go file as just a binary that I built, I’m passing in a JSON file for my local file system to mimic the event that’s coming in…
So I’ll do my full testing cycle, I’ll do everything that I need to do on my local machine, and then when it comes time for me to get ready to deploy it, I’ll swap out that handler that was reading in that file to be one that reads in an event from AWS… But the rest of the workflow is the same. There’s nothing specific to serverless that you have to really do in your codebase to get it to work… And I feel like maybe people are intimidated by serverless because they hear these terms and they don’t really understand what it actually means, but there’s nothing really different, like Johnny said, to just standard Go development, or any development. It’s just about how you’re getting that event and how you’re processing it.
And I think one of the main reasons why the cloud providers wanna provide an idiomatic experience at the end of the day - because as you put more barriers in terms of you’ve gotta learn new organizational tips/tricks in order to push to serverless, that’s kind of like against the serverless model. The main idea is you should care about your business logic, you should be able to use your existent tools and deploy things easily, and maintain things easily.
One of the things that I’ve experienced myself is usually I think organization-wide tips apply to serverless, but it also depends on - as Stevenson says - serverless is about deployment… So it really changed the way I organize my modules. I would bundle things together if I’m going to deploy them together; in terms of maintaining dependencies, I wanna make sure that they are represented by the same module file… Those are the only differences I’ve experienced myself. Otherwise, I can apply everything else to serverless programs.
Another one is global state. We talked about these things should be stateless, and global state is worth avoiding in Go altogether, I think in almost every case. Global state, for anyone not sure, is essentially variables in the package space. So if you do use those - and there’s lots of examples in Go where we see that, by the way, and there’s plenty of examples throughout the standard library too, but the trade-offs… It can be simpler, and you just have to write a main function, and you’ve got some variables in global space, for tiny little programs or scripts essentially - that sort of use case I can see why people would use them… But it really hurts testing; it does a few of the things – it introduces other bugs that might be difficult to find and solve, and that’s another that you can extend…
[00:32:23.16] So it’s not just “Don’t use the local disk, because the next instance might not have access to that same disk”, but “Don’t use the same memory. Don’t use global memory.” Don’t assume that an instance is gonna have that same memory over any length of time. Those kinds of things, again, are just good practice generally, too.
Yeah, you definitely shouldn’t assume that, and this might get into maybe some advanced topics, but once you fully understand the trade-offs and things like that, there are certain use cases where because there is the possibility that you’re reusing the same container, things like that… That you could maybe optimize for checking if you are in an existing container, things like that, and optimize for that, if you do have long startup times, or startup times that are gonna increase your latency… But that’s, like I said, a more advanced topic, once you understand what you’re dealing with and how these instances may live or die, and may come and go.
That’s a really interesting point, actually… And I do wonder whether – those sorts of optimizations usually involve some kind of complexity in the project, in the code…
…and of course, they might make sense at one point in time, but then over time they may stop making sense, and things. So that’s a very interesting thing that you have to also bear in mind - keep checking the architecture that you end up with, and make sure it stays relevant, and things. And don’t optimize too early, that’s the other thing.
Something that you’ve mentioned that actually bit me in the inverse - you’re saying “Don’t do global state”, and things like that; but just because it’s serverless doesn’t mean that you’ve got a fresh, clean starting environment. I had a project where I was pulling files from S3 and I was processing them, and they were cleaning up, because it’s serverless and it’s just gonna get rid of the container… And then I started getting failures after maybe 30 or so runs, and it’s because I filled up the disk on the execution environment without thinking that “Hey, maybe we’re getting reuse if the code is loaded in a hot path and it’s continuously using that same execution environment.” So cleanup is still important, and unsetting global state, if it is problematic, is still important, because you might get that same exact container back, and it might be problematic.
That’s interesting. And I suppose there’s also security implications there too, if you’re pulling data from one customer, and then you get the same instance and you’ve not thought about it… That is a very good point; I’m so pleased you agreed to do this, Steven. I think you might have just saved my life. [laughter]
And Stevenson, you mentioned something interesting about testing, and how you test. Perhaps this might be a controversial statement on my part, but all the work going on right now, and the way you do testing with regards to serverless - honestly, I don’t think it’s there yet. The experience is just too much. I don’t have the confidence to be able to test the entire setup locally, which is why I very heavily depend on unit testing, I very heavily depend on invoking or simulating the invocation with the right JSON payload… Basically, I’m trying to code the way I code any other Go application as much as possible.
But then there’s something to be said for doing some sort of integration-level testing. At what point do you cross over into saying “You know what - let me know assume that I’m gonna have some real event coming from some source other than my local development environment”? At what point do you cross that threshold and when do you do that sort of integration-level testing?
[00:35:51.26] For me even the case where you are getting events that are coming from some source, those events are very well-defined and adhere to a certain schema, right? You could maybe test the variability of the different data that you get back from those events, but like I said, I do straight up JSON files on my local file system, and I assert that the output is what I expect, or I assert that the event takes place the way that I expect, but I very much come from that hacky sys admin background where I’m writing Bash scripts and I’m testing right inline, and I’m making sure that the desired output is the true proof that the code works…
And for Go at least, I will change my testing very much; because I’m writing things local and because I can still run it just as a straight-up binary, I’m doing the _test files and I’m testing the things that I’d normally test during the function-level test, but overall it’s just kind of an integration-style test where I just assert that I’m getting back what I expect to get back before I go for it and try to deploy it and see what happens.
Can I ask a question? …before we talk about testing. How do we develop serverless apps? Given that cloud is a thing, and it’s just impossible to emulate – the development stack is just becoming so frustratingly complex… I find it so hard to keep the similar environments in my development environment. I think serverless is just adding yet another big burden, because it’s just far too abstracted away. The only way that you can emulate it is just basically running the thing in the cloud provider. So what is your strategy when it comes to development?
I don’t believe in emulating this full environment, because like you said, it is complex, it’s multi-variate, there’s so many things there… So I literally run it in a test account, I’ll reach out to a test S3 Bucket, I’ll reach out to a test Dynamo table and I’ll do that full exercising, because that’s the only thing that’s truly gonna test that code path that you’ve written for. You have to actually reach out to these APIs to find out certain things, and I think that’s the right level of testing given the amount of effort that you put into these functions. You’re trying to keep them small, you’re trying to keep things pretty fast-moving, and setting up a full mock environment just to do that seems like overkill, in my opinion.
And unit tests are more useful. The ones that you’re describing, Steven, sound kind of like unit tests; if the serverless is the unit, then you’re passing something in, and making sure what you get out is what you expect - that’s kind of a unit test. And for me those are the most useful tests, because if something does go wrong, they point like a laser, they point to what went wrong; ideally, only one test will fail if there’s a new bug, or something… And then you’re drawn straight to it.
But I know that Monzo - which is a bank - is written in Go; everyone should check it out, by the way. I think what they’re doing is really cool, and not just because it’s in Go… But I know that they have in-production testing; kind of like canary testing, where they will actually simulate real behavior in their production environment. And these tests are just running continuously, and they’re supposed to be capturing metrics and things, and checking to see how the system is performing, and all kinds of things at the same time.
Is it they are replicating some of the requests in their testing environment - is it kind of like a canary, but before it actually becomes a canary?
No, but I know that they probably – I mean, I think any mature project probably has quite a bit testing story… But in this particular case I heard Matt Heath talk about it, one of the engineers at Monzo. You can find him online, actually; he speaks about this subject very well.
They actually literally simulate real users using their bank cards, and transferring money to each other, and doing all these things that people really do. They probably do have another environment that they put code in before, and run the same set of tests, but… Yeah, it’s probably part of a wider testing strategy, for sure.
[00:39:54.03] And talking about testing in production, one thing that I’ve found to be absolutely critical is getting the right logging level and getting the right amount of information out of your function, because you don’t have a nice server to SSH into, you don’t have all these kinds of debugging tools that you would have in a traditional environment… So making sure your code is observable, making sure that your logging, making sure you’re understanding the execution path that your code decided to take is important, and debugging and quickly understanding where things might have gone wrong.
Along those lines, there’s two things I wanna touch on. Mat touched on a point that you have a function, that is a unit, representing some piece of work, some piece of business logic, but there’s obviously systems that are built that rely on multiple invocations of multiple different functions. There’s this orchestration that you need to introduce into your environment in order to get the right things to happen in right sequence. Any non-trivial one-function sort of thing. Whenever you need to have more than two or three, then you’re gonna need some level of orchestration.
The current best practice is that you shouldn’t have one invoke the other. If you have this chain on down, especially if there’s synchronous invocations, then the first one is kind of waiting for all the other ones to finish, and then basically you run the risk of running a timeout, and your request then just gets dropped on the floor. So there’s lots of gotchas, of do’s and don’ts over on that side.
We’re gonna swim back around to my other point, which is really around the debugging story, which in a serverless, highly-distributed environment like service applications, that requires a lot of extra infrastructure around it - logging, metrics, tracing, all that stuff. We’re gonna swim back around to that… But I wanna know how do you handle orchestration of multiple functions when you need to get something done?
Shout-out to Step Functions in AWS. I’m not sure if Google has something equivalent in GCP… But that really opened up a whole new world for me with regards to chaining serverless things together to make one big, cohesive unit. Before Step Functions it would very much be that use case where you described, Johnny, where you’re calling other functions or passing things via a queue, or passing things via some other traditional mechanism… But with Step Functions, you’ve got the ability to have – the output of one function becomes the input of the next function, and you’re able to chain it in that way…
It’s still decoupled, it’s still passing things, but kind of in a way that you could look at the transition between states and see what the payload was, and understand what that next function got, to the point where you could have two completely different functions and they’re doing the things that you expect them to do, because you’re looking at the payload, and you could test each independently with their own respective payloads, and as long as you make sure that the previous function output what you desired, then you’re in that good place where you’re getting the best of both worlds, where you’re getting that kind of synchronous execution from the outside, but internally it’s asynchronous and it’s decoupled from one another.
It’s interesting, Chris James in the Slack channel earlier mentioned the environmental cost, i.e. literally the green cost of serverless versus just having our own projects, and things… That’s the first time I’ve even considered that. I suppose I assume that the idea is there’s this shared resource, and that resource is there doing things, it’s ready to do things anyway… And we’re in theory all taking a piece of that and then just paying for a little section of it. So it feels like it ought to work from a green perspective, but actually I don’t know.
I’m also assuming that AWS, for example, is running a lot of their compute for serverless on their spot fleet, or the fleet of systems that are unused at that time, and it’s just an optimization on their part in order to make sure they’re getting maximum utilization from the set of services that they already have running. I doubt they’re spinning up brand new servers hardware in order to run these serverless functions. I think they’re very much making use of that additional capacity they have to run those micro VMs on.
[00:43:54.23] Maybe initially that was the case, but I wonder now… It depends how big AWS is as a business to Amazon… Because maybe they are now spinning up compute to sell; I don’t know. I’ve ruined the conversation.
[laughs] Yeah, I’m not sure. I’m gonna have to look into that, but… Because I know they do Spot, and things like that, it’ll be very unlikely that they’re not making use of that spare capacity to kind of provision for these micro VMs.
Yeah. Well, that’s the promise - or one of the promises, at least - of just the point of sharing this infrastructure.
Now we can swim back around to the other topic, of debugging serverless applications. Honestly, that was the biggest other revelation to me when I first started doing serverless work… Which is basically like, okay, in a traditional monolith - like Stevenson was saying - if you really had to, you could SSH into a box and get all your logging, everything that was part of a request, you could get to see it all there. Granted, in some places, and when you SSH into a box, I think you should jettison that box; I think that box should not go back into your fleet, but that’s beside the point.
But that old model, of being able to go see everything in one place - with serverless, that’s gone; that’s out the window. And one could argue that even in a high-scale distributed system, when you have long-lived instances, you’re not guaranteed that the request is gonna go to one instance either. So you kind of still have the same problem there, but with the serverless model you kind of don’t have a choice. If you have a situation where you’re orchestrating multiple serverless bits and pieces - something’s writing into storage, something consumes something from a queue, communicate with a 3rd party service, whatever the case may be, if something goes wrong and a user makes a request…
Maybe you have an API that fires off a Lambda as a request of this invocation, and then you’re touching on 3-4 other different things before some sort of response goes back to the user, if something goes wrong somewhere in there - say the user gets a 500 error, or something - where do you start? How do you even reason about this highly distributed environment where nothing is in the same place, none of it is all in the same place? How do you go about that?
[00:48:39.05] I have so many opinions on this topic, but I can say that – I think we usually start with the networking layer and distributed tracing to sort out what is the specific service the problem is coming from… And I think cloud providers are doing a good job, but not perfect job when it comes to instrumenting things, and exposing similar data signals.
I think the biggest problem is as a user I won’t be able to see end-to-end where is the trace. And a cloud provider contributes a lot to that, because some of the traces will come from storage, some of them will come from load balancers, and so on… You are kind of somewhere in between. Sometimes the cloud provider is having an outage, not you.
We had this particular problem, and our customers were calling us, “Hey, your services are down, or something. We’re having this difficulty keeping this SLA…”, and we constantly have to go back and debug their services… But we realized that if we consistently output a signal, a distributed trace at least, to represent and navigate the user as the initial step, I think that will be the optimal thing.
And then I think once you figure out your service, you can just go dig and look at the other signals, like logs and so on. But I think as an industry, we are having baby steps at this point in terms of diagnosing, or at least navigating the users or the cloud provider to where the problem is coming from.
What I’ve also found with regards to Lambda Functions, and things like that - typically, I’m using my error blocks in Go to make sure that I’m outputting as much detail as possible, as close to the source of the error as possible… But traditional things like correlation logging and things like that, with tracing, making sure – if you have multiple serverless functions that are tied together for a specific workflow, if you have the same correlation ID across that specific workflow, at least you can paint a holistic picture as to what happened throughout that thing…
And even for request/response cycles for HTTP for Lambda Functions, for example, you’ll have situations where you failed in a way where you can’t necessarily provide a response back to the caller, and it’s important that you dump out as much detail as possible… Because they’re gonna get a 500, but somewhere under the hood something went wrong that you could have logged. So instead of just dumping out that function altogether, logging that kind of detail makes it easier to get back to what caused it.
One interesting thing is you have to have some sort of instrumentation already in your function, and it’s just hard for people to determine what the instrument… I think we’re not doing enough work in terms of maybe investing in post-mortem debugging, and that type of stuff, so you can at a later time just go and put a breakpoint, just get a snapshot of the existing instance and take a look at some of the variables, or something like that…
But yeah, I think we’re baby-stepping… Absolutely no way to navigate to the problem, absolutely no easy way to correlate with other signals. And it’s an organizational problem also. When I was working on the instrumentation team at Google, we had to collaborate with all these 50 different products, and everybody has a different mindset about what instrumentation should be, and how it should be… And there are very few standards in this area, which is also not really nice… Because Google is doing its own thing, and then you’re just going to another cloud provider, AWS is doing their own thing… We can’t really participate into each other’s traces, or we cannot correlate, and from a user perspective this is terrible.
[00:52:29.03] Even though there are not any standard approaches, would you say doing common things like correlation logging, for example - we go to the most fundamental level, right? You end up with a log somewhere, whether that’s CloudWatch Logs, whether it’s [unintelligible 00:52:39.26] or whatever provider you use, there’s some kind of fundamental things that people can do, even if they have to kind of roll their own solution in order to trace back and understand what happened with the execution.
Yeah, when I say standards I basically just mean the trace side is standard… Which isn’t actually happening right now. Maybe in a couple of years we will be able to understand each other’s traces ID. It’s so fundamental, because you correlate everything with a trace ID. At least we’re doing that.
Yeah, that will be great when that happens, when it’s all unified.
Yeah. Distributed tracing is such an organically-grown tool, I think… There was no discussion between providers for a long time, and then all of a sudden people realize that it’s actually against distributed tracing not to have a standard, because we’re trying to compete with each other, and we can’t really go to the infrastructure teams or cloud providers to go and implement this propagation format.
The lack of consensus is actually against the fact that distributed tracing is not becoming a mainstream tool… So everybody got together two years ago almost to draft a proposal, and the proposal is now becoming more mature; it’s going to be more of a standard under W3C. There’s going to be a first-class header that everybody recognizes, and it’s going to be super-nice, because you can just go to MySQL and go just Hey, honor this header, or something, in some way… You can basically just go to any infrastructure tool and ask them to do something about it… At least pass it so the trace is not broken.
That sounds really awesome. Is there anywhere we can read more about that kind of thing? Is there a proposal that’s currently circulating?
Yeah, there’s a repo… I can maybe share the repo.
Would you be, by chance, talking about the OpenTelemetry stuff?
I could talk about that as well. This is a different initiative. OpenTelemetry is more of like an instrumentation library project. This standard is a wire header format standard. It’s under github.com/w3c/trace-context. You can read the proposal, and there’s already a discussion and some implementations for some languages. That’s going to be the overall standard in a couple of years.
That would be huge.
Pretty much every distributed tracing vendor, including cloud providers like AWS and Google, is actually contributing.
That’s huge, yeah. Awesome. So we’ve been talking about the technical pros and cons, some of the challenges, some of the things that you need to watch out for… I feel like we’ve been more sort of telling our cautionary tale than anything else. I think we all agree on the panel here that serverless - I’ll use that term because I can’t think of a better term; it’s a marketing term, but I can’t think of a better one to encompass all the things that make up serverless… But we know in general that it’s a good thing. It gives more options, more ways to build the right abstraction into your infrastructure, into your world, whatever business problems you’re solving… But from an opportunity standpoint, for Go developers, what is the draw? Why should I invest time in learning how to do serverless?
Perhaps you work somewhere where the only provider you’re allowed to use is GCP, or maybe it’s only AWS; why should you spend time learning any one vendors? Or even if you wanna go cross-vendor, why would you wanna invest time and effort into learning the right way to do serverless? Because it’s not just about the syntax, it’s not just about the code; it requires a different way of thinking. It requires you to learn a bit more about building these kinds of distributed systems. Why as a Go developer would I want to invest this time?
[00:56:40.02] I think a lot of modern developers end up doing a lot more glue work and stitching work than just straight up development… Because traditionally, there were systems that you would have to write yourself, but because they’re being abstracted and they’re being written for you and they become kind if provider-driven, you’re doing more stitching work nowadays, you’re doing more glue work.
And running infrastructure just to do glue work is kind of demoralizing, and you kind of have to maintain these things… But I’ve really found that’s the sweet spot for me with serverless - being able to write all these integrations, write all this glue work, but have that infrastructure also be that thing that’s abstracted away, so that these systems flow as if it’s a pure vendor solution, without having to run your own underlying hardware or your own underlying instances, and things like that. I really think that’s what makes it worth it. If you think about your own workload, you’ll find that you’re writing a lot of glue layers for things, integration layers and glue layers… So I think that’s definitely a good reason to learn it.
Also, I think it helps you practice stateless programming, and making sure you’re building these distributed applications without having to purely get down into the nitty-gritty of building distributed systems, and things like that. So it’s a good epic entrypoint to understanding how these things start working together to form wider systems that are achieving a common goal.
Yeah, I echo that. For me it’s about – if I have to write that glue, I don’t really know if I’m doing it the right way. It’s an extra kind of discipline, or something; I could make some silly mistakes, and that would cost me a lot of time or something else later.
I feel like serverless is kind of an empowering thing for a developer, so that you can focus on the bit that makes what you’re doing special, and leave the plumbing to somebody else. That’s why I like it. It feels like I’m empowered, and I don’t need to go and seek out help just to do things that really secondary to what I’m actually trying to do, or what I’m focused on.
For me it’s more about productivity. It’s a limited environment, but if it matches what I need… And why would I even have to care about all the lower-level infrastructure? I would just push things and pay as I go.
I think that’s what the problem behind cloud was initially. So I would start there, and if I need less limitations, then I can always float back to the lower levels. For me a good starting point is just having a more opinionated, maybe a more limited environment, and then go delegate some of the work to my cloud provider, and then go beyond that, and going to the lower levels if I need to.
Agreed. In the channel - let me see if I get… I don’t wanna mispronounce his name - [unintelligible 00:59:28.26] He mentioned in the GoTime channel on Slack that learning how to do serverless is actually a good way to learn how to build non-serverless systems better, as well. I totally agree with that, because I’ve noticed a certain level of concerns that I’ve started to have since I’ve begun doing serverless work, that I traditionally didn’t have. Things like “Oh, if I can make this stateless, then I don’t have to have sticky sessions, I don’t have to have…” There’s a lot of different things, a lot of different concerns – things that I used to take for granted back in the monolith days (the deployment model) that now I’m more concerned about. I’m sort of making deliberate decisions about whether to have this, or not have that…
[01:00:16.09] And again, with microservices - I guess you can call it serverless in nanoservices; we didn’t get into buzzword soup, but… That affects really how you approach building these back-end systems. For me that’s the big takeaway here - you build, you learn to solve these problems not really worrying too much about which vendor is going to solve the problem from a business standpoint, and let the deployment model, whichever vendor you use, let that be – perhaps not that very last concern, but don’t let a vendor provide the box in which you can build these things. You develop your wares, and then you worry about “Okay, how do I deploy this thing that actually already accomplishes the business functionality that I need? Now how do I deploy it in these other things?”
Anything else we wanna add to that before we wrap this up? I think we’ve gotten deep in some areas, and shed some light on some others…
I wanted to just shout out one more gotcha with serverless in general… The massive parallelism that you get from serverless can also be something that you get caught up with. For example, Lambda could scale out I think to 1,000 executions at the same time, and your poor databases on the back-end try to handle those requests, and you max out your connection…
So be mindful… Traditionally, you’d have connections pools and things like that where you limit those things, but now because you’re in this kind of multi-parallel execution environment, you may have 1,000 connections all of a sudden stampeding against your database.
So just understand those concurrency models and things like that, and make sure that you’re accounting for them when you’re reaching out to your resources and your environment, because they could come back and bite you.
Awesome, awesome. Well, this has been a very enlightening show. I’ve learned some things, and I’ve been doing serverless for a while, so I hope this was great for you, the listener. A big thank you to Jaana for coming back… We missed you, Jaana, and we’re glad to you have you back on the show.
Mat, Mr. Mat Ryer - I borrowed your accent a little bit; I’m not sure if you noticed, but I’m trying to sound as cool as you…
It was actually the only bit of yours I could understand.
[laughs] Nice… Nice. And a big thank you to our special guest today, Stevenson Jean-Pierre, my fellow Haitian-born.
Thank you for having me.
Alright, it’s been great. And for those of you who are listening live, hopefully this was useful to you as well. We love the participation in the Slack channel, keep it coming.
If you have show ideas as well, you can absolutely go on the Slack channel, GoTimeFM, and recommend some shows, and we’ll take it on and do our best with that. Also, thank you to behind the scenes… You haven’t heard much from him, but Jon Calhoun has been doing some of the technical work to make sure this podcast gets recorded properly and everybody sounds good… So thank you, Jon.
The reason we haven’t heard from him, by the way, is because he said “I don’t know anything about serverless. All I do is use DigitalOcean and the Google Cloud platform.” He really doesn’t know about it, because he does; he’s been doing it, and didn’t realize.
That’s the point. [laughter]
Awesome, awesome. Well, thank you so much for listening, and we’ll catch you on the next GoTime.
Our transcripts are open source on GitHub. Improvements are welcome. 💚