In this episode Matt, Bill & Jon discuss various debugging techniques for use in both production and development. Bill explains why he doesn’t like his developers to use the debugger and how he prefers to only use techniques available in production. Matt expresses a few counterpoints based on his different experiences, and then the group goes over some techniques for debugging in production.
Featuring
Sponsors
FireHydrant – The alerting and on-call tool designed for humans, not systems. Signals puts teams at the center, giving you ultimate control over rules, policies, and schedules. No need to configure your services or do wonky work-arounds. Signals filters out the noise, alerting you only on what matters. Manage coverage requests and on-call notifications effortlessly within Slack. But here’s the game-changer…Signals natively integrates with FireHydrant’s full incident management suite, so as soon as you’re alerted you can seamlessly kickoff and manage your entire incident inside a single platform. Learn more or switch today at firehydrant.com/signals
Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | It's Go Time! | 00:44 |
2 | 00:44 | Meet the guests | 01:34 |
3 | 02:18 | Debugging locally | 10:04 |
4 | 12:22 | Shared computer debugging | 06:00 |
5 | 18:23 | Don't use an else clause | 05:08 |
6 | 23:30 | Team size | 03:10 |
7 | 26:40 | Sponsor: FireHydrant | 02:38 |
8 | 29:18 | Debugging in prod | 03:51 |
9 | 33:09 | How long to keep logs | 04:26 |
10 | 37:35 | The golden boot | 03:14 |
11 | 40:49 | Metrics | 09:16 |
12 | 50:05 | Tracing | 02:16 |
13 | 52:21 | Unpopular opinions! | 00:18 |
14 | 52:40 | Jon's unpop | 00:21 |
15 | 53:01 | Bill's unpop | 10:07 |
16 | 1:03:08 | Matt's unpop | 06:29 |
17 | 1:09:37 | Outro | 01:00 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Hello, everybody, and welcome to Go Time. Today I am joined by two common guests. First we have – not common in the sense of they’re common people, but common as in they’ve both been on the podcast several times… We have Bill Kennedy. Bill, how are you?
Yaaay! That’s the new thing these kids say now. Except they say “Yipee!” I can’t do Yipee, man. I’m too old, so…
You weren’t here for the episode with – Johnny Boursiquot was telling us about rizz, and how he had just learned what it means…
Oh, no, no… Yeah, rizz is okay, but the new thing now is to stand in front of the mirror and say [unintelligible 00:01:15.16]
I think my kids are too young for me to be [unintelligible 00:01:22.01] all this stuff… I don’t have that culture coming into my house just yet.
But you will.
Someday. Alright, our other guest is Matt Boyle. Matt, how are you?
I’m good. And even though you disclosed that I wasn’t, I am a common person.
Okay, if you say so… But I don’t think so exactly. But alright. So today, we’re going to be talking about debugging. And I think this stemmed from - Matt was working on a new course on debugging, and was asking people about their preferences with using the debugger… And I think Bill had reached out and said that he doesn’t prefer to use the debugger. And it was kind of – I think Matt was shocked to see how many people don’t like the debugger, and oftentimes view it as maybe a code smell, whereas there’s others who insist that, even with a great codebase, a debugger is incredibly useful. So we all just wanted to kind of have a conversation around debugging in go, the tooling, and all the different stuff that might help people out there who are trying to debug their programs.
So I guess to start, let’s start off with the fact that debugging locally and debugging in production are kind of different things. So I think it’s better to start with maybe local debugging, and talk about tools and approaches there, and what you’ve found works well for you, and what things you maybe don’t prefer… So do either of you want to sort of kick it off and explain what you prefer to do?
Maybe I should start with what my philosophy is, which really started the fun…
Okay.
…so the conversation on Twitter. And kind of what I teach, and how I manage developers. So for any backend development that I’m doing - no, let’s actually step back. I have a general rule, okay? If you can’t use the tool to identify a bug in production, you cannot use the tool at your desk when you’re devving. That’s the general rule. Okay? So again, layer one rule - you can’t use it for production, don’t use it at your desk, because you’re not helping yourself. So for backend devving, I’ve never been able to connect a debugger to anything that’s running on a production environment, and use it to help find a bug. So I don’t like to, nor do I want my developers on my team using a debugger at their desk when there’s a problem happening, even when they’re in development.
Now I’ve got a 20-minute rule that sits on top of that. So if you’re working on this for 20 minutes, and you’re still not able to apply enough logs, and your mental model of the codebase is still really weak, to the point where you can’t find this bug, then I will allow you to turn the debugger on for backend dev. However, once you find it, we have to sit down. Because I have to understand why the debugger was needed to do this in the first place. Is the code a mess, or is this a mess? Or you don’t know enough? What is it? I need to understand that.
Now, when it comes to frontend dev, which unfortunately I’ve been doing too much of. My hair has really turned gray over the last few weeks thanks to JavaScript… Chrome has a debugger in the browser, right? So we’ll occasionally use it instead of console logs, because it’s there, and I can use it for any site that’s in production. It’s a tool that’s available to me in production, so it’s okay for me to use it even when I’m devving.
Now, let me say one last thing before I throw this baton over to Matt. I think the debugger is an amazing tool when you’ve been given a codebase that you have zero mental models of, talking even backend, and you need to gain those mental models. Looking at logs and just code may not be enough when you first join a team, or a company, and go crazy with the debugger for your first 2, 3, 4 weeks. But I think your goal needs to be to get off of it as quickly as possible, and start using it less and less and less, to the point where you don’t need it anymore. I can’t tell you how long that should be, because I’ve seen codebases that are ridiculously complex, and it takes months. And I’ve seen codebases that aren’t, and you only need it for a week or two.
Anyway, that was what I was sharing on Twitter, and we had people going back and forth between agreeing with me, and Matt, who didn’t agree with me. So maybe now Matt can kind of give us his sort of philosophies on this.
Yeah, I actually think we’re more aligned than I thought as well, actually. I think that final sentence you said there is kind of my entire point of view, honestly. And I think that’s the difficult thing about Twitter, and why I’m glad we get to have mediums like this to discuss it, is Twitter is not great for nuanced conversation.
[00:05:58.00] But that final thing that you mentioned there is pretty much my whole philosophy. I think the debugger shouldn’t be called the debugger. I understand why it’s called that, and I think it came from a place where it does have tons of value in finding specific bugs… But the biggest use case I’ve found it valuable for is for building mental models of codebases that you’re not familiar with, or maybe you were familiar with, but it’s been a while since you’ve worked on it. And to Bill’s point about complexity, at CloudFlare we quite often move services between teams. I have services in my teams that have been in production since 2017.
Sometimes businesses make conscious choices to take on tech debt that isn’t paid back immediately, and that can mean sometimes code isn’t perfect, it hasn’t got perfect test coverage. Some of it has a lot going on that potentially could be carved out into another microservice… And using a debugger to be able to figure out all of these things, and maintain state that’s really hard to maintain in your head just makes it a really powerful tool to kind of get up to speed with something that maybe would take you much longer too if you didn’t have other tools available to you.
There is occasion where I do think it’s incredibly useful for solving specific issues, and I think that’s where I’d maybe disagree with Bill a bit, on using things that aren’t available in production. I think we should take advantage of things that are available to us locally, that maybe aren’t available in production, too. Debugging in production is possible; you can connect to Docker containers and Kubernetes to do debugging… It’s not really recommended; it has performance overhead. But one example I will give you is sometimes we’ll get tickets raised by customers that we need to look into, and they’ll be very specific, niche, nuanced use cases that lead to errors. When you’re working at large scale, these 1% edge cases aren’t really edge cases anymore. They happen all the time, and you need to be able to recreate them. And being able to recreate that locally is really, really powerful. And I wouldn’t want to have to spin up a tracing stack and a Prometheus stack every time to debug it, and being able to kind of use conditional breakpoints and things like that, and make a fake environment that I want to to be able to test a specific scenario is really, really useful, and I do do that sometimes, too.
You’ve got to remember, I started developing professionally in the early ‘90s. There’s no cloud, and the systems that we were building were deployed on-prem, all over the planet. So I had rules that I – harder today now, because we work remote. But when we were working in offices, I had a rule. I tell this to people, and some people just think I’m an evil dictator, but the rule was that if I caught you in a debugger without my permission, I sent you home for the day.
Now, I paid you. It wasn’t like I sent you home without pay. But I literally told you to just pack your stuff up and we’ll try again tomorrow. And I made that point because anytime there were problems out in customer sites, all you had available to you were your logs. That was it. That’s all you had. And so I get the idea that “Hey, I can reproduce this bug now, so I’m going to use my debugger.” But I think what happens in these cases is you don’t go back and add the extra context and the logs that you needed to identify and find that. And so maybe you fix this one edge case bug because you were lucky enough to be able to reproduce it… Which is hard in these situations. But now you threw the code back into production without maybe a little extra content, or a little extra refactoring, or a little extra something. So now when it happens again in a slightly different way, you’re back to where you were.
So I sent just about everybody on my team home once, in a team that I manage. And I promise you this - they all came back into my office at some point during their time with me, thanking me for doing this to them, because a bug happened somewhere, they got their logs, they were able to fix it very quickly, and the stress levels go down.
So I’ve seen it work with a lot of success, and it takes some discipline… But I’m afraid to use the debugger as well on these backends, because I think you could ended up not fixing the real problem, fixing the code, or fixing the logs to be better.
[00:10:05.19] So if I understand this correctly, I think you’re both kind of getting at the same point of whenever debugging is used, it’s still worth the it’s a very important thing to go in and be like “How can I fix this, so that in the future we can figure out what’s going on without the debugger?” Whether that’s adding better logging, or something else. Something needs to be there so that that’s a remedy in the future. And it sounds like, Bill, your stance is very hard against the debugger, mostly to force people to focus on that long-term goal of “We need to be able to fix these things that happen in production.” So it’s not that you hate the debugger, it’s more that you’re thinking “Okay, you need to focus on how we’re going to fix these problems in the future, because using the debugger every time and hoping that we can connect to the production system isn’t really a realistic hope for the future.”
Well, it’s a crutch that’s not helping you gain the mental models that I think you could gain quicker. And then identifying if the codebase needs some refactoring, whether that’s restructuring something, renaming something, doing something. I think it ends up just being “I’ve found it, I’ve fixed it. We’re done.” Because that’s what I’ve seen. And there isn’t that next step of engineering that it needs. One of my favorite quotes is “Debuggers don’t find bugs, they just run them in slow motion.” And I really kind of believe that.
And I think that’s a great thing. I think being able to run a bug in slow motion is really powerful and really helpful. I definitely understand your point of view now that you’ve kind of shared that story… But I think that’s a process issue. If you have a process in place that if you find an issue, you fix it now, and then you add the missing metrics, monitoring, alerts, tests, the madness shouldn’t happen again, then I think we’re in alignment. It’s just how to do that last piece about the process to make sure that people do go back and fix it. And that’s a cultural thing. That’s building a culture around making sure those things are really important.
What I teach in my classes is if your first instinct is to use the debugger, stop. Give yourself 20 minutes without the debugger. Just try. Try for 20 minutes to find that bug, reproduce it, and fix it. And if you can’t in 20 minutes, then use the debugger. But what I want you to do is see if you’re able to, over time, eventually start fixing these things without needing to get to that debugger. And if over time you can’t, start questioning what’s happening. But at least try; at least start to try.
So can I give a story just to sort of – I think this is very similar to what Bill’s saying. So I’m just curious if that helps. So when I was in college, I did these programming competitions, where it was like heavy algorithm type problems… And you were on a team that was three people, but you had one shared computer. So the way it worked was like each person would sort of pseudocode solutions to different problems, and then you’d sort of swap who was on the computer to actually code up their solution. And as a result, you’re working on fairly complex algorithms like Min cost, Max flow, things that are a little bit more complicated than just a simple sort, or something like that.
And because you have three people and one computer, you can’t realistically sit there and debug for 15 minutes, because somebody else on your team needs that computer, and this competition is timed. You might have two hours for the whole competition. So what I learned to do in those situations was as soon as you run into an issue, you don’t run the debugger or anything; you print out your code, and you let somebody else take the computer, and then you sit down with your code, and you walk through it and think “Okay, I need to think deeply about my code and figure out where the logical flaws are, and understand what’s going on with this, and see where the issue is.” And I think that’s part of the reason that I don’t tend to use the debugger, at least initially, very often, is because I had all of this training of not doing it, of sitting down and looking through the code… And granted, these were much smaller programs, but it was still kind of the bases that are built off of.
And the other thing that I noticed was that when people did use the debugger, at least if they used it too fast, that instead of thinking about the problem, and the code, and understanding how everything works, they were too quick to kind of fiddle with random things and flounder a bit, just trying to address one small thing without actually addressing the underlying problem…
[00:13:59.23] And I’ve heard several people say that they don’t use the debugger, because they use tests instead. And I guess the counter example I would give to that is doing algorithm type stuff, I’ve noticed that people would do that, they would have test cases and be like “Oh, here’s a test case that fails. I just need to fix my code so that one passes now.” And what they wouldn’t realize is that there was actually – the algorithm they were using, or the solution that they thought was right wasn’t actually correct, and they hadn’t figured that out. But they could adapt their algorithm to sort of handle that one edge case, and then it would just fail on the next one.
So essentially, by forcing yourself to take a step back and not use the debugger right away, it kind of forced you to think about the code, and like how everything is working as a whole, and to start thinking logically, “What are some big things here that might cause this code to break?” And then you could think “Okay, where do I need logging or whatever else to help me clarify if that’s actually what’s going on?”
And it wasn’t that I didn’t like the debugger, it was more just – I mean, a competition in the sense that you just didn’t have the computer for it. But on top of that, I’d found that when I jumped into the debugger to quickly myself - and a lot of people I watched code would do this; they wouldn’t take the time to think about the program as a whole, and to really understand that all, and they would just kind of flounder with like a very focused viewpoint of something where they think the issue could be. So Bill, does that sound similar to what you were trying to get people to focus on, I guess?
It reminds me of a story from 1988 when I’m in university… And you’ve got an 80 by - I don’t know, 80-column by 20-row monitor, and you’re trying to write a fairly large Pascal program… And there’s no debuggers there, but it’s not working. And I remember printing out the entire program, and laying out I don’t know how many sheets of paper; 20 sheets of paper. It was the dot matrix, so every page was connected to each other… And laying it out in the hallway in the dorm, and just starting from the top, because I couldn’t get more than 20 lines of code of review at the time… Printing it all out, I think within like 10 minutes of printing – it took longer to print and for me to find the problem, after I laid it all out on the floor, and I could see a larger amount of code… Which - I guess we’re spoiled today with all the monitors, and…
I still work off of one laptop monitor. I do not connect, even when I’m home, to a second monitor. I don’t give myself that luxury. Because it’s so rare, I just had to learn how to even work on a single screen. Some people’s setups are really conducive for doing more, but I’ve always been a minimalist as well with everything. But I think in line I agree with your experience, at least, that you’re sharing.
So Matt, I guess my question for you is “Does that kind of approach–” It sounds like you’re agreeing that actually taking a step back and just not diving in and floundering with the debugger without like a general sense of where you’re supposed to be going is sometimes a good thing to do.
I think it’s always a good thing to do. I think the first module I added to my course was actually debugging by [unintelligible 00:16:55.22] and it’s actually techniques of how you can figure out how a codebase works by looking at it, and how you can pair with people, and how you should pair with people to get better at understanding what’s going on. You should always try and figure it out yourself.
I think the example you gave, Jon, was pretty extreme, and it makes sense in the context… And as you mentioned, it was a fairly simple codebase. As I say, some of the codebases I’ve inherited - they’re pretty complicated. And being able to figure out what’s going on and build those mental models is not something that I – you know, maybe people much smarter than I can work through it very, very slowly, and get there… But there’s also an element of productivity gain I think you get from the debugger, which shouldn’t be ignored… Especially if there is an instance where you’re seeing a customer-facing issue, and it’s in production right now, there’s kind of two steps there… I think kind of related, but they don’t necessarily have to be. One is fixing the issue, and the other is understanding. And I don’t necessarily think those two things do need to happen at the same time, especially if you are trying to kind of get a fix out.
It’s just about having process and discipline to go back in and make sure you do understand it, and you do add whatever test coverage was missing or whatnot. And I think that’s what Bill’s getting at, is if you don’t have good discipline within your company or yourself, and you don’t do that last bit to go back, then actually I kind of understand the point of saying that you shouldn’t be able to use a debugger, because I’ve seen historically that that last step doesn’t happen. So yeah, I do understand, I think. I do see where you’re coming from.
[00:18:21.24] Well, let me add to something Matt said, because one of the things that I do – people come to me and ask me to sit with them for a couple hours a day over a period of few weeks to just do code review, and help them improve their code, or help them with whatever. And I see this a lot; I see a lot of code that is being written that isn’t as readable as it otherwise could be. And I’m spending time trying to teach people why not to do that.
Here’s a here’s a simple example. For me, you should never – okay, there’s exceptions to every rule. So again - okay, ground floor rule, right? Don’t use an else clause. There’s almost never a reason to use it. I saw an exception today. There was an exception I’ve found today, where – actually, no; we got rid of it too, didn’t we? We did. Especially in Go, because the switch statement is so much more readable than an if, especially if else. And Go has the naked switch, where you can do boolean logic on the case.
And so just a conversion like that, where you find your if/elses and you convert them to switches can improve readability almost – I don’t know what the number is. I don’t want to go crazy and say an order of magnitude. But it improves – and every time I show somebody that, they’re like “Oh yeah, this is so much simpler to read.” So there’s a lot of these little things that I try to teach when I’m in these sort of coding sessions, and explaining why I feel it’s more readable, and nobody ever pushes back when they see it.
So I might also be lucky in the fact that I have code, and most of the time that’s more readable than maybe things that Matt are looking at, where he just can’t decipher what’s going on because of the way the code was written. And at that point, you have two choices. Do you touch that function and clean it up into a coding style that’s more readable? Not necessarily. I’ll give you a story behind that. I had a guy that was working for me, who reached out; he was very upset, because his manager berated him, because he changed a function. He “cleaned it up”, because he saw that it was kind of a mess. And the manager just ripped into him. And he says, “I don’t understand.” I go “You know what, dude? That guy was nice. I would have ripped into you harder.” And he said “What are you talking about?” I go “Your job wasn’t to touch that function. Your job was to fix something over here. But you arbitrarily came over here, and you made some changes to something that wasn’t broken. And now we don’t know if you broke it or not. Don’t touch it if it ain’t broke. Just leave it alone. I don’t care if you think you improved it. At this point I can’t sleep at night not knowing whether you broke it or not.” “Well, Bill, we have tests.” “I don’t want to hear the tests thing, because I don’t know if there’s a test that covers the change you just made.”
And so Matt’s bringing up to me a good point, is that you might be - and again, I have the luxury of that rarely happening to me… But you may be in a situation where the codebase is written in such a way that it’s just hard to read, and you almost have no choice but maybe extend the debugger there just to get you through some logic.
[00:21:36.01] And it might not even necessarily be hard to read. It could just be different programming styles. I think one thing that I’ve – I read a book a couple years ago that I’m rereading at the moment; it’s called “Software engineering at Google.” And one thing they do that I really like is - and Jon, I know you’re at Google, so maybe you can talk to this a bit… But they’re pretty strict with coding style, and they have this concept of readers, and they have people who are like certified readers in certain languages. They have one way of writing Go, and if you want to get things merged, you have to get it approved by someone who’s like a reader of the language. And to make changes to the way you write code at Google, you have to find exceptions where the current standards basically don’t work, or they can be improved by changing something. And I think as you get to bigger and bigger enterprise, that actually makes a lot of sense, because there may be hundreds of engineers working on various things at CloudFlare, maybe thousands at Google, tens of thousands Amazon, maybe tens of thousands, maybe even more… And so as these folks need to contribute across all these services, having sort of a standardized way of writing can be really helpful for avoiding some of the things you talked about there, Bill.
Effectively, what the engineer did was apply opinion, right? He was like “I think there’s a better way to do this.” But it maybe wasn’t evidence-based, it was just based on opinion, and a point in time opinion… Whereas if you have a style guide, or a way of writing Go that’s published for your company, it makes it very easy for folks to know where they should and shouldn’t make changes. And you can also lint that as part of your like CI pipeline. You know, “This change actually doesn’t meet our style guidelines.” So I think stuff like that can help here, too.
And I tend to be that person on my teams, where I’m doing the code reviews. I had a guy on my team who just almost didn’t care, and I would just go in and clean up the code. It was easier than me sending a comment at that point just to move a project forward. But I tried to write a style guide one time, and I almost wanted to shoot myself after about two weeks. That’s tedious stuff, and… Linters are really hard, dude… I don’t know.
I think some of that stuff depends on the size of the team… Because Google’s at a large enough scale that the idea is that – they have like a Java readability guide, or a guideline for writing Java… And the idea is that you could jump from team to team within Google, but they all follow the same style guidelines for writing code that looks roughly the same. And Bill, when you were talking about not having an else statement, I don’t remember if it was at Google, but I worked somewhere that part of their style guide for Java was that every if block actually had to have an else block. So you had to like fully enclose “These are the two options, and we want them very clearly there.” So it was the polar opposite of what you were saying.
And while I do prefer the Go approach of generally not using else in 99% of the cases, I definitely benefited from the fact that when you got used to reading it, and you knew what to expect, it was helpful to know that all the code was written in the same way. Even if it wasn’t necessarily the most optimal way, the fact that you got used to reading all the code that was written in a similar way really went a long way to making up for any of its shortcomings. So I can definitely see a benefit to that. But I definitely agree that if you’re a small team, making a style guide and trying to enforce it would be a lot harder than like if you get to a big company, it’s like “This is just a cost of doing business that we have to do.”
At least for the teams that I’ve always run, I have a rule that I should be able to open up any source code file and have no clue who wrote it. Like, that’s a goal for me. If I can identify who wrote this, we’ve got a problem. And why? And I try to identify “Is the rule too much for you, you just can’t handle it? Do we need to simplify something?”
Did that rule apply to comments?
Dude, comments is code.
I feel like writing styles would be a lot easier to pick out at that point.
Well, not the style, but for me a comment is code. So it has to be a proper sentence, proper sentence structure, proper grammar… That’s all I want. I’m not critiquing that part of it you were mentioning. But at least it should be something that’s a complete sentence, that somebody can read.
I was kind of joking around, just making the point that I’ve worked with some people that I could definitely point out their English writing for comments… So I would be able to say this was their function, not because of the code in the function, but because of the way the comment was written… Which is kind of cheating a bit. If I just looked at the code, then it wouldn’t be as obvious.
Yeah, but if I saw a comment that was, say, broken English, I would go fix it at that point. I’d go “Let’s clean this up.” And why not pair with the person at some point when you’re fixing those comments? Because I’m sure they don’t want to write sentences like that… So you’re helping everybody.
Yeah, in some cases it’s not necessarily broken English, it would be – it could even be the opposite. Some people who are very well spoken or very well written would use words that I know other people in the team are absolutely not using. So it goes both ways, I guess. And you don’t want to tell them “Don’t use this word, because most people can probably read it and understand what it means”, it’s just, it’s not words they’re typically using when they’re writing.
Dude, I misspell words… I probably get more PRs on misspelling. But I had a teacher in university who said “Bill, you can misspell the word all you want as long as you consistently misspell it.”
Break: [00:26:28.17]
Alright, so we’ve talked about debugging locally, and kind of gotten the general mindset of it’s important to have the resources to be able to debug things in prod. So when it comes to debugging stuff in prod, what are your favorite tools or approaches to actually figuring out what’s going on and finding a bug?
My whole life has revolved around logs. Not even really metrics, not tracing… Logs. That’s what I get. So I don’t even like logging levels, outside of maybe a logging level producing a certain type of event where maybe I get a push notification or an email or something if I write it as an error level as opposed to info… But I don’t even do logging levels as in – I want to see more logs in dev or production, because I get the logs back, and if the information is not there, I tend to have two choices: add some more logging and throw it back in, and pray that before it happens again, maybe I can find it. Like, that’s been my life, since 1991 at least.
We have a slightly different philosophy than that. I always start with logging. Logging is, I think, the easiest thing to understand where you’re trying to kind of get towards that journey of productionizing your service. Most folks are familiar with logs; it’s a great place to start. And you can build some pretty cool stuff on top of logs. You can actually build entire dashboards using things like Kibana, and they’re actually very useful.
The problem where logs tend to fall down is the volume of them, right? As you start to move towards having more and more services and more customers, and given even just like the rate of potentially error logs that are happening around different teams - that can start to be like a huge amount of data. And logs are not really something you want to sample, in my opinion. When you start to sample logs, you start to lose context, and you might miss things. So compared to some of the other methods that we’ll probably talk about in a second, logs are great place to start, and you can get some really useful information out of them.
I actually agree with Bill… We tend to have two log levels in my teams. One is debug, the other is error. So we have the ability to like just turn things up a little bit for a while, if we want to… But for the most part, we just log everything to like the error channel. If it’s not something you need to debug, or it’s not an error, you probably shouldn’t be logging it. It’s probably not gonna be that useful to you… And we have to be quite thoughtful about the amount of logs we produce, because we actually do rate-limit teams if they log too much, because it can impact other people’s logs, and we’ve got to be able to share the crazy amount of data that we do need to produce to be able to keep some of these systems running.
Yeah, there’s a signal to noise ratio that you have to find, and you can only do that if you’re using your logs. Even when I run tests, I write the logs to a buffer, and then I throw them out into the screen [unintelligible 00:32:02.21] And I’ll look at the logs and I’ll say “Is this lig giving me signal to what just happened successfully, even if it’s successful? Or is it helping me find the bug?” So that’s another reason why I don’t want you in the debugger sometimes, is I want you to work with the logs, and make sure there’s a signal to noise ratio there that’s reasonable.
You can’t log as an insurance policy, and that’s what Matt is saying. We used to do that when we had 10,000 people on a system, just log everything. Can’t do that anymore, because you get a million people overnight. So logging as an insurance policy and praying that it’s there - it’s not viable anymore.
I feel like it’s one of those things that if somebody’s new and getting started with a new application, that some – like, I’ve found myself doing logging as an insurance policy; not as a long-term plan, it’s more of like I know mentally this is a short-term fix, knowing that as this thing scales, it’s not going to be sustainable. But I think you kind of have to – it’s like any other code, you have to kind of go in knowing that that’s the case… Because when you get to like Cloudflare scale, all of a sudden things that worked at 1,000 users or 10,000 users doesn’t work at billions of users. Like, it’s just a completely different world.
So here’s a good question for Matt… Because I haven’t worked at the scale that Matt has in terms of where he is. I’ve never needed to keep more than four days’ worth of logs, ever. I don’t store them in databases; if I don’t need them in four days, I’ll never look at them. I just know that. And the four day is just because you might have a four-day weekend. If it wasn’t for that, it’d be three days. And so I’ve worked in one company where it was healthcare, where there was regulatory stuff, and I still didn’t have to keep more than four days’ worth of logs.
So I just wonder, if you’re in an environment where you don’t have to keep the logs for more than four days, you might have - and I want your opinion on this, Matt… Maybe you have more flexibility to throw more noise in there. As opposed to “We can’t throw anything away”, which - I’ve never been in an environment. I’ve never had that constraint. So things change. So Matt, maybe talk about how much of this log data and metrics and all the other things that you’re collecting you need to retain.
[00:34:06.22] Yeah, it’s a great question. We definitely don’t retain everything forever, to be clear. I think we definitely retain for longer than four days. I think we retain for sort of between two weeks and 30 days for certain indexes… But we vary it massively depending on what the index is for. Like, we do have regulatory things that we do need to keep around, so they’re all logs that we need to keep around for auditing purposes… And they may stay around for as long as two years, four years. I’m not sure the exact details, but we definitely keep things around for a really long time. Especially as a security company, we want to be sure that any changes that are made are audited, and we can keep those around.
In terms of for like code-specific, let’s say error logs from an application, without getting into sort of more nuanced things… We tend to keep those around for sort of seven days or so. I think we could keep them around for less, honestly. I think four days seems like a pretty decent metric to me. I think the biggest concern for me would be over the weekend periods. If there was a long weekend, and people are out Friday and Monday, and you’re gonna catch them out for four days, you might miss entire chunks. So I think that’s why 7-14 days is probably a good number.
We’re less concerned with the volume of predictable log data. The biggest concern is unpredictable log data, and that’s why we rate-limit. Because even with the best [unintelligible 00:35:12.24] in the world, storage is limited. Say we have a bucket, let’s call it 100 terabytes of storage space… It’s very, very easy for a system to go wrong, and if it starts spitting out millions or billions of log lines every day because of – say someone did deploy and say “Every time a user takes this action, print out success.” And they were doing that for every single person who’d like pass through to Cloudflare edge; like, you would end up with quite literally billions of logs every day.
So being able to rate-limit each service, and make sure that things like that can’t happen is more important than the amount we retain it for. Like, predictable retention is fine; it’s unpredictable retention that scary. And we call it the noisy neighbor problem. Like, you’ve got to be a good citizen to those around you. So if we’ve got 100 terabytes of storage data again, and I use 99 of it, and only leave one left for the rest of all the engineering teams at Cloudflare, I’ve been a noisy neighbor; I’ve not been a good citizen of the environment. And that’s the stuff we’re trying to help protect against, is just make it easy for teams not to make those sort of mistakes.
So I guess you have other services that are consuming those application logs, and doing that work.
Yeah.
I could definitely see that being challenging, especially if a bug or something gets introduced that all of a sudden spikes the amount of load… It essentially sounds like that writing rate you want to be constant, and constant’s easy. It’s just anything dynamic is where challenges arise. But I think that’s true for most software things, is if you can plan for things, it’s a lot easier, but if you can’t plan for them, it’s a lot harder to do.
Yeah, and this just gets into distributed systems, doesn’t it? Distributed systems are hard, because they’re unpredictable. If you can predict traffic, and the amount of scale you need, the amount of logs you need… We’ll link to it in the show notes, but Bill wrote a really great blog post on limits in Kubernetes. And the only reason you need those things is because of like unpredictable things happening. Like, you need to be able to limit the amount of things that happen to scale up in your infrastructure, and down, and whatnot, because traffic is not predictable, and customers is not predictable, and users aren’t predictable, and you’ve got bad actors… And even if you don’t mean to be, sometimes people in your company can seem like bad actors, because they make a change that - as you said, a bug that logs everywhere. So just being able to have control mechanisms to make it really easy to do the right thing, and for things to operate even when things aren’t going perfectly is really, really powerful, and it’s something you need to lean into as your system and your company gets bigger and bigger.
I worked at a company in the ‘90s, a healthcare company; we were building a patients system and all that good stuff. We had this thing called “the Golden Boot”. It was a size 15 sneaker, maybe even larger, painted cold. And if anybody ever made a mistake that brought down production or the development team from working, you would get the boot. You had to write down what you did on a piece of paper, put it in the boot. The boot would sit on your desk, but you’d get it once you fixed the problem. It wasn’t like you get it right away. You identify the problem, you caused it, you fixed it, you had the boot. You could have it for two seconds, or you could have it for two weeks. Somebody was getting it in another week.
[00:38:15.07] And I always loved this idea of the boot was never the shameful thing. It was really about this idea that we’re all working hard to move something forward, and we’re gonna make mistakes. And as long as you own up to them and fix them, it’s okay. It’s expected. In fact, if I ever had people on my team that didn’t have the boot in the whole year, I would start to question what they’ve been doing all year. Because to me, it was almost impossible not to get it if you were really working hard with the team to push initiatives sort of forward.
So I don’t know what the industry is like today with these sorts of mistakes, but I’d love to make sure that any developer… You know, if you’re making a mistake - you don’t make the same mistake twice, however - it’s not necessarily a bad thing all the time. At least I know you’re working; at least I know you’re trying.
There’s this saying I really like, which is “You win some and you learn some.” And I think it’s really true. Like, if you’re not making mistakes, you’re probably not learning enough. You are going to make mistakes throughout your career. Everybody has a production-breaking story, and it’s just that you make sure you learn from it. There’s that saying, as well as another: “There’s two types of engineers. Those that have broken production and those that are about to.” And I don’t think that should be something you fear. And what I’m talking about is these things we put in place to make sure that logs are rate-limited, and you can’t produce too many of them; it’s exactly for that reason. If a human can make a mistake that takes down production or breaks something critical, it’s probably a technology failure, because they should have been able to do that. And if they were able to do that, what can we put in place to make sure that it doesn’t happen again, and the same mistake isn’t repeatable? …which just goes to what Bill was saying - if you allow the same incident or issue to happen 2, 3, 4 times, that’s a process failure, and potentially a failure within your team of how you think about and how seriously you’re taking those things, and you should really try and figure out “Okay, well–” There’s the famous story that always goes around about the intern dropping a production database, right? Like, that intern never should have had access to drop a production database. So that’s not the intern’s fault. That’s a really big company issue, actually, and you need to fix that.
I always felt things got crazy when teams had to start running their own production systems, because - dude, I dropped deleted data in production systems that brought it down, and… I mean, that just taught me that developers should never have access to production, ever. Because we see a bug, we want to fix it right away; we don’t have the patience to wait. And so I’ve always kind of run my teams that way, and when I’m in environments where the developers have access to production, I just wait, because it’s going to happen.
Alright, so we’re getting a little bit far along on the time… So let’s move on to talking about other approaches for production stuff. We talked about logging some. Matt, what are some other approaches that you like to use at Cloudflare?
Yeah, so the next sort of rung on the ladder up is metrics. We’ve touched on this a little bit throughout our conversation already, but I think these for me are probably the most valuable thing you can add to your application. And these are where you can be a little bit more liberal than logs, I think. I think you probably can and should expect your metrics to be sampled, but we tend to use metrics more for looking for patterns, rather than for investigating individual use cases and issues. So I’m a big fan of Prometheus. Hand heart, being truthful, it’s probably because I haven’t explored so many other options. Prometheus has always been present most of the way through my career, so I’ve tended to use it, and it works really, really well, so I’ve never really explored too many other options. But being able to have metrics like “Oh, user signed up. Transaction failure.” And then you can add cardinality to those things. So you can say like “Transaction failure. Card type: Amex.” And then you can build beautiful dashboards using Grafana, and make alerts based on these things.
[00:41:59.16] So you can say things like “Oh, we’ve just seen a spike in transaction failures for American Express.” That’s a really, really useful signal from your system. And then you can go and jump into other systems; maybe in this instance it is your logs. Or maybe you jump onto American Express’s page, and you see that they’re having an outage too, and it all makes sense, and you can send out customer communications, or put a something in place on your website to make people aware that those issues are ongoing, and you’re working on it.
But having metrics of patterns that you can see going on in your system, success, failures, errors, with various levels of cardinality is really, really powerful. And there’s a massive asterisk here that I’ll add quickly though, which is this only kind of works if you have a reasonable amount of traffic. So you do need to have a certain amount of scale before metrics are useful. Before that, logs are probably fine. But once you start to see a few hundred, a few thousand requests a day going through your system, these things can be incredibly useful to you.
I’m going to go back to signal to noise ratios here, because I worked at a client one time, and they had the prettiest dashboards all over the floor. And they would walk people in suits on the floor, and show them the dashboards… And I’m standing in front of a dashboard one day, and I’m looking at it, and one of the teammates come over to me, and I look at him and I go “So which number do you look at on that graph?” And he said “I don’t look at any of that stuff.” I go “I don’t either. But it looks pretty, doesn’t it?” And then we kind of both laugh, right? And I just sit here, looking at this stuff going “Oh, look at the amount of CPU we’re wasting… Network bandwidth, disk and CPU we’re wasting just so we can walk a suit through the office and look like we’re this high-end tech shop.”
So I’m always really cautious on the metrics side, because I just don’t want to produce something just so we have something that looks pretty, and it looks like we’re technically astute, or something. There’s got to be signal there. And the things that Matt said I think are awesome things to be looking at. Like, the patterns and potential problems that can come up. But almost everybody has some sort of graph of CPU in memory, and those types of things, and I’m just wondering, Matt, do you guys look at that stuff too, and create warnings around that stuff?
Yeah, I’d say CPU especially is probably our – it’s probably the metric I spend the most time looking at, I think. CPU is one of those things that’s like very dynamic with what’s going on in your environment, like how much usage is going on, and one of those things that sometimes is a little bit unpredictable.
I think one maybe unique thing for me, or about Cloudflare that maybe other folks don’t have the same concern with, but Cloudflare is a cloud, but as someone who works at CloudFlare, it’s an on-premises deployment, right? So everything we run is in our own data centers. And so we don’t have infinite ability to scale horizontally, or even vertically. We’ve got to really think about how we use our resources. And we have plenty of them, to be clear, but being able to use them smartly, and making sure we’re not sort of wasting cycles, and we’re giving ourselves enough flexibility to deal with spikes in traffic is an interesting problem that I never really had to think about too much until I was here. But we do spend time thinking about making sure we’re optimizing for those things, but also, we have plenty of room to grow when we need to.
Is it possible to give an example of where you’ve looked at the CPU graph and realized “Oh, I need to take action on that somehow.”
Actually, I spent some time looking at this today… So one of our teams is responsible for running our internal CI system, and there was an incident where our CI system stopped receiving job requests for a little period, due to like a network blip. So effectively, engineers were trying to [unintelligible 00:45:33.02] to build, and it wasn’t triggering the build. So what we did was we restarted the system, and it meant that folks were basically effectively trying to get their builds scheduled… And this is kind of known as the “Thundering Herd Problem”, is like something fails, and then everyone just clicks Retry a million times.
[00:45:53.20] So we had our sort of steady job loads coming in, we had lots of people retrying jobs, we had other systems that were trying to interact with us in the sort of 15-20 minutes (we had some downtime), and this system operated sort of between 60% to 80% CPU at normal periods, but because of all this sort of extra traffic that was coming at it, it all of a sudden went to nearly 100, which meant that all the jobs that were already in the queue were starting to slow down, but then new jobs were still coming in, because we had so many builds going on at once that we kind of ended up in this really difficult situation where we are struggling to process the amount we have, it’s slowing down the system, but then more jobs are still trying to get in at the normal rate… So you end up in this really difficult situation where you kind of have to either scale vertically, horizontally, or make a decision to potentially rate-limit or pause the amount of jobs coming in, so you can process what you’ve got, and then you can proceed. So that’s just one example.
I do have another example from maybe a more customer-facing thing as well… But I worked on a project, it was one of the last projects I worked on before I became an engineering manager, actually, and it was called crawl hints. So you can type in Cloudflare crawl hints; I’ll share a blog post. And it was a project I worked on that I was really proud of. But effectively, what we did is we’d take signals from internet traffic, and then we used that to push information to various search indexes, to let them know that something might have changed in a website.
So if you think about it, before we built crawl hints, what happens is you have all these bots that go around the internet, scraping the internet, looking for changes to content to decide what to rate it on a search engine. And we worked with some of our search partners and we were like “Well, that’s a really inefficient use of resources.” You’ve got all these bots, all over the world, scraping the internet. What if instead we pushed information to them as websites change? So that’s what we did. We built that. Cloudflare is in a unique position to be able to give information about when sites change, so we started to build that.
So what we did is we took a bunch of information from our edge, and we forwarded it to a Kubernetes cluster, and we began to process this information to figure out what was fresh, and then push it to the search engines. So the service I wrote was in Go, but what was really interesting is we kind of did this on a polling loop, for various reasons that are even more confusing; these search engines had – they have rate limits, right? We can’t push too much information to them within a period; we’ve got to kind of throttle ourselves a little bit. So what we were doing is we were pushing this information into Redis, we were storing it there for a bit, and then we were pushing it to the search engine.
So we had these situations where the system I was running was like really spiky. You’d basically have like no CPU usage, and then all the CPU usage. And we traditionally have things like Horizontal Pod Autoscaler set up on our Kubernetes pods, which is a very fancy way of saying if various things start to go up, then you scale the pod horizontally. So we basically have this thing where the pod was like bored, it was bored, it was bored, and then all of a sudden it burst to life, and it has all this traffic, and it has all this work to do… And it was kind of making all these confusing signals and metrics and graphs and things. So trying to figure out a way to – instead of kind of letting the Horizontal Pod Autoscaler scale, or to have a whole bunch of CPU available, it was about trying to tune the workload so that we had the right amount of CPU we needed to do the job we needed to do, without having to kind of like scale it and unscale it, if you will, continuously. And it was kind of like that trade-off between “What’s the easy thing to do, and what’s the right thing to do here to use a limited amount of CPU?” Because as we said, we don’t have infinite.
That makes sense. It basically it sounds like that was a case where it wasn’t specifically fixing a bug, it was more just seeing that your current setup wasn’t optimized, and you could optimize it in a way that was more efficient for the company and everybody involved.
Totally. And I think that’s a key thing to call out. There’s a lot of tools we talked about here, like debugging, logs, metrics… Like, they can be used for debugging, but they can also just be used for like system optimization, system health, learning… In some cases you may even find people use some of the metrics to do business metrics. So I mentioned about like transaction successful and AmEx. Like, if you’re a FinTech company and you’re trying to figure out very loosely or quickly what’s the volume of Visa versus Amex versus MasterCard for today, you could use metrics from a service to do that, too. So there’s a wide range of uses here, which is really cool. It’s not just debugging.
Alright, is there anything else you guys want to discuss before we move on to unpopular opinions?
[00:50:05.02] I think the only other thing is after metrics, like tracing is pretty cool, too; distributed tracing. We don’t need to dig into that too much, but distributed tracing is definitely worth looking into once you’ve kind of got those things in place and you’re looking to take things to the next level. Being able to have like a trace of a request, including times, with logs attached to it, and stuff, throughout your system, and potentially building up a dependency map of your systems that talk to each other is also really, really cool, and really, really powerful, but probably unnecessary, unless you are working in a fairly large distributed system.
My only opinion there is that unless you’re going to dedicate time to look at that information on some regular basis to look for improvements, you’re wasting CPU cycles again. So that’s my only thing there.
I do agree with you there, Bill, that people have a tendency to want pretty graphs… I think I’ve told this on Go Time before, but I used to have Google Analytics on my personal website, that I have blog posts and stuff on… And I eventually took it off, because I realized I wasn’t doing anything useful or actionable with that data. I would look at it to be like “Oh, cool, I’m getting this many users”, but it didn’t affect what I did at all. I wasn’t trying to figure out which article should I write more about, or anything; it was sort of just a random thing that I’d toot my horn about, of like “Oh, cool, I have these high numbers.” And I basically realized, why am I making use of these, or collecting all these analytics and making people agree to sending me analytics if I really don’t need it?” and I just decided to remove it. And I’m like “If I’d get to a point where I can act on that data, then great. I’ll stick it back in there.” But in the meantime, I don’t need to collect data I’m not acting on.
That’s reasonable. As I say, I don’t think that’s for everyone, which is why it’s probably not worth spending too much time on… But I think as you start to get to bigger companies, and lots of – I’m talking sort of in the order of like hundreds of systems talking to each other, maybe tens… Like, it probably makes sense to explore the dependency graphs that they can build for you, because they can be powerful to help you figure out like blast radius of incidents, right? And if one service is down, what impact does that have?
I think all things, it definitely makes sense to realize that solutions that work at one scale don’t always work at the next scale. And as engineers, we kind of need to appreciate what scale we’re at, and adjust our mindset based on where we’re going to work, and what scale we’re working at.
It’s reasonable, yeah.
Okay. I’m gonna jump us into unpopular opinions, then.
Jingle: [00:52:22.23]
Alright, I feel like I should add the first unpopular opinion that this theme song – it has eight seconds of basically silence at the end. So… I feel like it’s too long here. Most people probably don’t notice that. So I don’t know which one of you wants to go first. I know Matt was cheating and outsourcing his unpopular opinion, so I don’t know if that should be outlawed or not… But Bill, do you have an unpopular opinion you’d like to start with?
I don’t know, I share so many things on Twitter sometimes that’s unpopular… But trying to upgrade a bunch of code I have, and some of it is JavaScript, and it’s driving me absolutely insane… And yesterday - my wife knows JavaScript; she’s a frontend dev. And we were trying to figure something out, and she was using ChatGPT, and I was using the way I always do everything. And I ended up getting the answer first. Not only that; even what ChatGPT was giving me was close. But I don’t know JavaScript enough… So this is interesting… I’m ranting right now. When it comes to Go, I don’t need any tooling; really any of that tooling. But when it comes to all these other languages that I don’t know, I find myself heavily dependent on them. So I think the unpopular opinion for me is that sometimes I speak about Go-related stuff from a really unique space, where I’m always “Why do you need that tool? I don’t need that tool in Go.” And then I find myself on the JavaScript side, and I’m like “S**t man, I’ve got to have these tools”, because I’d be lost, right? So I think some of my opinions lately about AI and AI tooling that I’m not using on the Go side is maybe wrong.
[00:54:11.19] I don’t think it’s wrong. I think it might be contrarian, but I’m always glad people like you exist, Bill. I’m always glad that someone giving a counter argument or a counter view to the mainstream view, right? Like, everyone’s like “Oh, we should use AI for generating this, that and the other”, and so having someone go “Well, hang on a second… I’ve gone through my whole career without it.” And what about this use case? What about that thing? Have you thought about this? I think it’s really great to have people like you challenging those things, and challenging my use of the debugger as well. So this whole episode has been one unpopular opinion, in a way… So as I say, I’m glad to have these conversations, and I’m glad these people exist to give a different point of view.
I can definitely relate too, Bill, because I’ve noticed that – I’ve seen great developers, but they jump into a new language, and they’ll have a bug, and like somebody who’s been writing go for a while can be like “Oh, here are like the two most likely things to look at.” And I think it’s just that experience thing, of like, if you’re in a language that you’re used to, your mind quickly just knows “These are the things I would check first.” And that means that “Oh, I don’t need to use a debugger, because I can just jump to those things.” But when you’re new, it’s like, “I don’t know where to start. I just need to somehow get to that point”, and it’s a lot harder.
But I’ll tell you, the one thing about looking up documentation versus going to ChatGPT is there’s nuance in the documentation when you find the right thing to read. And the tooling isn’t giving you that nuance; it might give you a piece of code that works, but the little nuances that you can find inside of documentation lets the light bulbs turn on, where the tooling, the AI tooling is not turning light bulbs on, because it’s kind of like feeding you.
And I was doing another pair session today with somebody, and I wanted to scream, and I didn’t turn it all off, because I wanted them to start typing some stuff, and instead of typing, they’re just tabbing. And I’m wondering if this person isn’t learning Go as fast as they otherwise could, because Copilot keeps popping fairly good code up, but do they understand that? And I’m starting to wonder if they’re – I think their development is going to be slower than it otherwise could, because they’re not typing stuff. There’s my unpopular opinion. I think developers today are going to be less than what they could be, because they’re not typing this stuff in, and getting a mental model of what they’re typing and why, and they’re just taking what’s there and it compiles and it works.
So do you think it’s going to be like – I always say, developers in the past have had to learn how to google stuff. So essentially, learning what to Google to figure out your bug, and figure out like what your problem is, or to figure out how to solve a solution used to be a skill that you had to develop. And [unintelligible 00:56:46.19] it’s seeming like people are now leaning on ChatGPT and things like that to solve those problems. So do you think it’s going to be something where the great developers are the ones who learn quickly, are the ones who learn to utilize those tools when it’s a good idea, and to learn things on their own and code things on their own at other times, when it actually benefits their learning process?
This is what I think is going to happen. I think we’re gonna have a generation of developers that don’t understand why, but can get it to work. They can get it to work, but they don’t understand why. We’re gonna have this period of time, I don’t know what it is, where we’re going to be in a really worse place than we are now at legacy code. And then, the Googles of the world are going to take products like Service Weaver, or they’re going to take things like Encore, and they’re going to end up building them, putting them behind a prompt, to where you now can engineer an entire CRUD-based service with a prompt… Because it’s not just about Go code anymore; they finally have a project structure and a framework that they can work in. That’s what’s missing right now.
[00:57:54.09] The Go team said that they’re going to focus a lot on training models on what good Go code is. That way, when somebody is using one of these tools, there’s a higher percentage that they’re going to get better code. And I think that’s great. I think they should be doing that if they can, because it’s gonna help th Go ecosystem. If all these tools produce better Go than any other language, and everybody’s going to be dependent on these tools, then you should use Go.
But the next step is going to be how do you take something like Service Weaver, which is a framework, how do you take something Encore, which is a framework, how do you take maybe another framework, and then plug that in in such a way where I can give a prompt a data model and say “Build me a full-functioning service, with that full structure and deployment capabilities”, and then eventually, I can say “Give me a report that does this, and give me a report that does that.” That’s going to come, but it’s going to come with the fact that we’ve got these frameworks right now that nobody wants to use, because everybody still wants to write their own code, and they’re looking for a home. And that’s where the home is going to be. So when that finally happens, then all that development goes away, because now basically you’re able to get a majority of what you want to build kind of done through a prompt. I don’t know how far away – that could be five years, it could be less. I don’t know. So there’s more unpopular opinions right there.
I feel like – I don’t necessarily have the counter-argument, but I feel like some of this sounds familiar with like people saying there’s going to be Stack Overflow developers… Which I think probably do exist, to some extent; people who can search Stack Overflow enough to get the right code to sort of solve a problem and paste it in there and get it to work, but they don’t really understand what that code is doing. And I think ChatGPT is kind of like a more powerful version of that.
There’s a really famous indie hacker. I share his Twitter handle; it’s called Petr Levels I think is his name… And he makes millions of dollars a year by – he’s built AI apps, he’s built remote communities and stuff like that… And he has done all of this without really truly learning how software works. He kind of just like smashes together PHP. And he’s proud of it, by the way. I don’t think he would be mad at me saying this. He’s very much like a great product thinker, and he really leans into what’s popular at the moment, and discovers trends… But he quite often shares controversial by software engineering standards on Twitter that really rile up the software community. And it kind of makes me laugh, because you’ve got all these software purists telling him “That’s not how you should write software.”
For example, he shared the other day that he’s never done a join in the database. He does all of it in PHP. He does all of his join logic inside his code, rather than in a database, because he just doesn’t see the point of doing it in a database. And as you can imagine, that made tech Twitter very mad.
But he just uses code as a means to an end, and it works for him. And I think there’s probably a lot of people out there like him, who - especially as indie hackers, or someone just working for themselves, you can just get away with not really ever understanding what’s going on, as long as you can kind of smash everything together so it works. And I wonder if we’ll see more of those people or less of them.
I think you get away with more of that for frontend stuff than backend stuff… Because you have that immediate, “Hey, there’s a problem”, and I think you get that feedback faster. But on backend systems, I’ve always taught people to act like you’re building an air conditioner. None of you today have probably been thinking about the air conditioners in wherever you are, because they’re just working. It doesn’t even enter your mind. But if that air conditioner breaks, it becomes all-consuming, because you just can’t function. Your brain just won’t allow it. So backend systems have to be air conditioners at the end of the day, and so they have to get to that level.
I mean, the way I would probably put it is that I’ve seen some of Petr Level’s stuff, and I feel like he would be a bad employee at a software company, is the way I would put it. Like, he’s great for the type of things he does, where he comes up with unique ideas to build a product around, and he’s good at marketing that product and getting it to make sales, and doing things like that up to a certain point skill-wise. And he does great with all that, so he’s successful because of it. But the reality is, if he came to CloudFlare and had to work on software there at that scale, I don’t think he would be a good fit at all.
[01:02:16.20] So it’s kind of like just a different - what he specializes in just happens to be very different. And as software engineers, we can kind of view this as “Oh, it’s all software engineering.” But the reality is software engineering at Cloudflare is very different from software engineering something that is maybe just an app that’s meant to make $50,000 a year, versus Cloudflare is likely making – or sorry, maybe it’s 50,000 a month. But I’m sure Cloudflare is making much larger numbers than that, and dealing with much larger traffic loads… So it’s just a completely different problem as a result.
And like Bill said about the backend services, I would also assume that Petr is using a lot of services that if they were designed the way that he writes code, his whole business would fall apart. But because he’s using things like Stripe, and other things that are built very reliably, all of a sudden he can do the stuff he’s doing, because the things he’s using are very reliable.
Yeah, it’s a really good point.
Okay, Matt, are you going to bring your crowdsourced unpopular opinion?
Yeah, so I did have one which was - it was kind of about AI, so it was a little bit close to the one we’ve already discussed, so I’m gonna pick a different one, which has just come in from someone in my team, actually, which I think is a perfect unpopular opinion. So Andrea Medda - I’ll share his Twitter handle; he works with me at Cloudflare, and last week he showed me his phone, and he sorts all the apps on his phone by color. So the all in folder is called like orange, and yellow, and green, and blue. And the apps are organized like that. So for example, Duolingo was in the green folder. And I think this is absolute madness, and he thinks this is really reasonable and efficient… And he just can’t come around to the point of view that this is like a crazy thing to do. So his unpopular opinion is sorting apps by color on your phone is perfectly fine, efficient and a reasonable thing to do.
So counter-argument, I guess… What happens when you don’t know the color? Like, the example I’ll give is I have Substack on my phone. And for whatever reason, the notification icon on Substack is green, but the app icon is orange. So for the longest time, when I was looking for it, I was looking for a green icon, until I realized that it wasn’t green. I’m pretty sure it’s an orange app or icon on my phone.
It is, yeah. I tried to challenge him on this. So I did try and test him, but he did stand up to the test pretty well. Like, maybe I just picked the wrong apps, but he was bouncing around between them and finding them very quickly.
It’s hard to test him, because you can take any single person’s phone, no matter how poorly organized, and if you hand them a new phone and let them use it long enough, they could probably start to tell you where an app is supposed to be… Not because the organization system is good, but because they’ve got this muscle memory of “Oh, I’m trying to click here, and that app is not here anymore.” It’s the same with – there’s a lot of people that probably could not sit down and tell you the layout of a keyboard, but if you let them sit there and try to type, they could eventually fill out the whole keyboard, because muscle memory is there and they know where the stuff’s supposed to be. Now, I don’t think that means that our keyboard layout is the best one there is… It just means that we’re all used to it and it works.
Do you both organize apps on your phone into any sort of like pattern or categories? Or what’s your approach to finding an app?
I have so little of them… I only have like two pages of them. But if I have to find something on my iPhone, I’ll scroll down, and just do Find.
That’s exactly what I do. I just pull from the top and start searching.
I think I have an entire page or folder of just like biking or health-related apps… So that’s bad enough on its own. I have some folders that organize by work, or like another folder for like password-related stuff. So like Authy, 1Password, or like authenticator apps for various things that require their own… And like finance, just a couple things like that. But I basically have one page that’s all my most common things that I know, and anything other than that I’m generally swiping to search and just typing in the name of the app… Because at that point, if it’s not on the main page that I normally use, it’s easier just to search.
[01:06:06.20] Yeah, that makes sense. And I guess - random question, that’s tangentially related… But have you all discovered any apps in like the past year or two, that you’re like “Wow, I can’t believe I haven’t had this before now”?
Oh, man… I can’t think of any off the top of my head.
Let me think… The latest app that I’ve installed on my machine –
So I can tell you to that I’ve installed, if you’re thinking, Bill…
I mean, I’m looking. Probably the local news app, because I was just tired of the whole authentication when an article came up the Miami Herald… And probably the only other most recent - and it’s not that recent - is CBS Sports, because I’ve gotta follow college football, and stuff. I just don’t stall apps.
I’d say the two that I installed, that - I guess I was more shocked that it took me that long - was WhatsApp and Telegram. And it was more just because for the longest time I never had a reason to do it, and then between talking with people internationally, and then I’m also traveling to Thailand… And I want to be able to message people at that point. It was like “Oh, I can’t just text at this point. I’m going to need some other solution at that point.”
Yeah. I was surprised when you said you didn’t have WhatsApp or Telegram. They sort of prevail in the UK. I assume they’re everywhere, but I think it’s more common in the UK right now.
In the US they’re not common at all. In the US almost everybody just uses text messages, because they’re used to just talking to people in the US… And it’s one of those things that I saw – I don’t know what it was; it was some little reel on Instagram or something, where essentially they were making the argument that people will say people in the US aren’t well-traveled, but a lot of people don’t realize… Like, Matt, you and I had this conversation, that I took an eight-hour drive in a day and thought that was completely normal, and Matt thought that was a ridiculous amount of time to drive in a day. And I think part of that is the fact that it takes three days to drive across the US, and where you live, it takes two hours to drive across your entire country. So that really shapes your mindset of what is a long drive at that point… Whereas like eight hours doesn’t even get me a quarter the way across the country… So it’s like “Yeah, that’s not very far.”
Yeah… Europe is really small, generally. Not even just the UK. And especially in London, it really is a cultural mixing pot. So we have people from all over Europe and the world here. So I can get to Spain in like a couple of hours. I get to France in like 45 minutes from the right place in London. It’s kind of wild. So I think we take that for granted here, that the geographic borders - if you took the whole of Europe and put it over the US, I’m not sure what it would look like, but I imagine the US might still be bigger than continental Europe, excluding like Russia and stuff… Which I think is mad. It’s such a big country, and I think it’s taken for granted if you don’t live there.
Yeah.
And especially if you’re not in tech, and you don’t have the opportunity to travel a whole bunch… The US is just so big and so diverse.
Basically, what I was going for though, that I kind of fell short on saying, was - the argument was that people say Americans aren’t well traveled, and the counter-argument is that they’re well traveled, they just travel within the US, because every state is like the size of a country in Europe… So it’s just a different way of traveling. And it is kind of wild how different some states can be… Which - I don’t know, it takes some traveling to realize it, but there’s definitely some differences between them.
Yeah, we were looking once, and the UK actually fits into Texas, and there’s still quite a lot of space left. Like, the UK is smaller than one of your states, which is mad when you kind of put things into perspective, for me at least.
Alright, Bill, Matt, thank you both for joining. Everybody who was listening, thank you for listening to Go Time.
Our transcripts are open source on GitHub. Improvements are welcome. 💚