In this episode Matt, Bill & Jon discuss various debugging techniques for use in both production and development. Bill explains why he doesn’t like his developers to use the debugger and how he prefers to only use techniques available in production. Matt expresses a few counterpoints based on his different experiences, and then the group goes over some techniques for debugging in production.
Matthew Boyle: Actually, I spent some time looking at this today… So one of our teams is responsible for running our internal CI system, and there was an incident where our CI system stopped receiving job requests for a little period, due to like a network blip. So effectively, engineers were trying to [unintelligible 00:45:33.02] to build, and it wasn’t triggering the build. So what we did was we restarted the system, and it meant that folks were basically effectively trying to get their builds scheduled… And this is kind of known as the “Thundering Herd Problem”, is like something fails, and then everyone just clicks Retry a million times.
[00:45:53.20] So we had our sort of steady job loads coming in, we had lots of people retrying jobs, we had other systems that were trying to interact with us in the sort of 15-20 minutes (we had some downtime), and this system operated sort of between 60% to 80% CPU at normal periods, but because of all this sort of extra traffic that was coming at it, it all of a sudden went to nearly 100, which meant that all the jobs that were already in the queue were starting to slow down, but then new jobs were still coming in, because we had so many builds going on at once that we kind of ended up in this really difficult situation where we are struggling to process the amount we have, it’s slowing down the system, but then more jobs are still trying to get in at the normal rate… So you end up in this really difficult situation where you kind of have to either scale vertically, horizontally, or make a decision to potentially rate-limit or pause the amount of jobs coming in, so you can process what you’ve got, and then you can proceed. So that’s just one example.
I do have another example from maybe a more customer-facing thing as well… But I worked on a project, it was one of the last projects I worked on before I became an engineering manager, actually, and it was called crawl hints. So you can type in Cloudflare crawl hints; I’ll share a blog post. And it was a project I worked on that I was really proud of. But effectively, what we did is we’d take signals from internet traffic, and then we used that to push information to various search indexes, to let them know that something might have changed in a website.
So if you think about it, before we built crawl hints, what happens is you have all these bots that go around the internet, scraping the internet, looking for changes to content to decide what to rate it on a search engine. And we worked with some of our search partners and we were like “Well, that’s a really inefficient use of resources.” You’ve got all these bots, all over the world, scraping the internet. What if instead we pushed information to them as websites change? So that’s what we did. We built that. Cloudflare is in a unique position to be able to give information about when sites change, so we started to build that.
So what we did is we took a bunch of information from our edge, and we forwarded it to a Kubernetes cluster, and we began to process this information to figure out what was fresh, and then push it to the search engines. So the service I wrote was in Go, but what was really interesting is we kind of did this on a polling loop, for various reasons that are even more confusing; these search engines had – they have rate limits, right? We can’t push too much information to them within a period; we’ve got to kind of throttle ourselves a little bit. So what we were doing is we were pushing this information into Redis, we were storing it there for a bit, and then we were pushing it to the search engine.
So we had these situations where the system I was running was like really spiky. You’d basically have like no CPU usage, and then all the CPU usage. And we traditionally have things like Horizontal Pod Autoscaler set up on our Kubernetes pods, which is a very fancy way of saying if various things start to go up, then you scale the pod horizontally. So we basically have this thing where the pod was like bored, it was bored, it was bored, and then all of a sudden it burst to life, and it has all this traffic, and it has all this work to do… And it was kind of making all these confusing signals and metrics and graphs and things. So trying to figure out a way to – instead of kind of letting the Horizontal Pod Autoscaler scale, or to have a whole bunch of CPU available, it was about trying to tune the workload so that we had the right amount of CPU we needed to do the job we needed to do, without having to kind of like scale it and unscale it, if you will, continuously. And it was kind of like that trade-off between “What’s the easy thing to do, and what’s the right thing to do here to use a limited amount of CPU?” Because as we said, we don’t have infinite.