Go Time – Episode #117
Telemetry and the art of measuring what matters
with Dave Blakey
Telemetry is tricky to get started with. What metrics should you be tracking? Which metrics are important? Will they help you predict and avoid potential issues? When is a good time to start? Should you put it off until later? In this episode we discuss some common metrics to collect, how to get started with telemetry, and more with guest Dave Blakey of Snapt.
DigitalOcean – DigitalOcean’s developer cloud makes it simple to launch in the cloud and scale up as you grow. They have an intuitive control panel, predictable pricing, team accounts, worldwide availability with a 99.99% uptime SLA, and 24/7/365 world-class support to back that up. Get your $100 credit at do.co/changelog.
Algorithms with Go – A free Go course where panelist Jon Calhoun teaches you how algorithms and data structures work, how to implement them in Go code, and where to practice at. Great for learning Go, learning about algorithms for the first time, or refreshing your algorithmic knowledge.
Notes & Links
- OpenTelemetry - Telemetry software that is the merger of OpenCensus and OpenTracing
- Nova - ADC software created by Dave’s company
- statsd - Open source stats aggregator used often in telemetry collection
- Prometheus - Monitoring system for metrics
Click here to listen along while you enjoy the transcript. 🎧
Hello, and welcome to this episode of Go Time. I am Johnny Boursiquot, and joining me today are Jon Calhoun, Jaana Dogan, and special guest, Dave Blakey. Welcome to the show, Dave.
Thank you. Thanks for having me.
Jaana, it’s good to have you back. You seem to have been traveling the world, and trying this whole–
That’s not true…!
I only took a week of a break. I’m serious, yeah…
Okay, okay… The pictures - I’m like, “You’re right. Mm-hm.”
But next week I think I will be in London, so if you see me joining with Mat from the same room or something, don’t get surprised…
Oh, that would be kind of cool. But you see, you are traveling the world. You’ve been in more places in the last month than I have this year. Actually, it’s still a new year, so… [laughter] Jon, how are you, man?
I am doing well.
You’ve been busy lately. You’ve been releasing courses, and trainings, and everything.
Yeah, I’ve been busy… I mean, the truth is I’ve been busy for a while; it’s just everything gets finished at the same time, so then it looks like I’ve been especially busy now.
It’s good to ship, so… Yeah. Good stuff. Dave, we know little about you, but we’re about to fix that. We know a little bit about what you do and who you work, or should I say whom you work for… But in today’s episode, we’re actually gonna cover a topic that’s special, and near and dear to my heart, as a cloud engineer or operator… As I sometimes call it, telemetry, and you’re gonna have to forgive my accent here. The French is trying to come out… Telemetry… Yeah, you can correct me [unintelligible 00:03:13.11]
I think that was perfect.
Okay, good. Thank you. Telemetry is something that we rely on quite a bit if we’re doing cloud operations work, but it’s not just for that. The use case for telemetry is much broader, and you actually are working on something that actually involves quite a lot of that… But before we dive into that, I’d like us to level-set a little bit; let’s talk about what telemetry is, what it’s used for, and who is best positioned to leverage it.
Absolutely. Telemetry is an extremely broad term, as you said. Obviously, here we can narrow it into at least computing and modern computing, I suppose, but at its core, it really means collecting and storing, and I like to think using, but not necessarily… But collecting and storing data from remote sensors, or remote machines, remote computers. So from the left side being how much electricity or gas is your water heater using, to the right side being what’s the response time of an API back-end that you have. It’s around getting that information, and all of that kind of stuff.
[04:26] I think more recently it’s around a lot of the techniques and the areas in which to add telemetry to applications while you’re building them. As you all know, when you start to scale things out, it’s often too late, if you haven’t done it already.
So it’s one of those things where it may not feel important in the start of a project, but when you need it, you’re wishing you had put it in from the start, kind of thing… Yeah?
Exactly. And sometimes you don’t know you need it, until you’re way too far gone. When you start having bottlenecks, you start having problems… I mean, that’s the traditional problem that telemetry tries to solve, but now that is with security consents, and with scaling, scaling out, scaling in… Things are not static anymore; telemetry has started to play a much bigger role than just saying “Why is my webpage slow?” It’s much more than that now.
Is it because there’s not enough established practices around this? I think a lot of companies I’ve seen - especially when they’re first bootstrapping - they don’t necessarily care about anything around production excellence, or SRE practices… And telemetry plays a big part in this, but they’re maybe thinking that it can be such an afterthought, and then eventually feel very overwhelmed by the amount of work they need to do.
Exactly. And the worse that becomes, it’s like it’s this kind of snowball effect… Because you just start randomly adding telemetry, and it’s not really – you’re ultimately trying to solve a problem, and I think the best way to look at telemetry is to try and store all of the important components, and even some of the ones you might not think are important yet, for an application. Not to solve a crisis now, but to shine a light on this process as a whole.
A lot of the time, if you add – you say “Well, why is my website slow?” A simple solution would be to say “I need more web servers.” But it might be the database server that’s slow. That’s a common example, but you get the idea. It could be something that’s not immediately obvious to you, and the more data you have, the easier it is to track that kind of thing down… And that’s just around scalability.
What is the best approach in terms of planning? Should we start thinking about this at the design process, or what is the best time to start thinking about telemetry?
Absolutely. I think when someone is launching a large-scale project, it’s probably something that they’re all considering already… But I think maybe it’s more appropriate now to say “What about a small one or a medium one that’s got some growth?” Obviously, if you’re doing an internal wiki for your own five-man office it’s probably not a big problem… But if you’re building a project, I think it’s important to start from the first line of code, basically. And it can be very simple - you can just have a class in your system that you can send random gauge information to or metrics to, or whatever… And once that exists, it’s very easy to just parse that information there.
That’s what I would advise - make it super-easy to send just a slug, and the value, and that it’s a gauge or a metric or a point in time measurement to a response time. And even if you don’t actually send those anywhere on day one, at least you’re starting to put it in the code.
At our business we have what we call our code contract, and it’s this set of nine rules for everything that we write… And one of the rules is that everything has to use this telemetry helper that we put in, because we knew at some stage it would become a problem. And it uses very little development work, you know?
[07:58] Yeah, I’ve seen a lot of cases where people are debating what to collect, and how to collect, and so on… I think there’s also some sort of confusion around what matters for the success of the project, and so on. So you have to be more holistically maybe thinking about all the specs - availability, debuggability - in order to at least have a better understanding of what you wanna collect and how you are going to be utilizing it.
A lot of times, small companies end up failing because they start too late, and so on… But it’s very important to start thinking about this at the very early stages.
Exactly. I’d rather be in a situation where you are collecting some information that was useless, or you were collecting something in not the most efficient format or something, than you either were collecting nothing or had this telemetry paralysis, where you feel like “Well, we’ve gotta put so much time and effort into this.” Ultimately, I think just do what suits the project and the business, and just make sure you’re doing something, and it’ll evolve.
I’m interested in pulling that thread a little bit… Jaana, you kind of touched on it when you mentioned basically trying to keep track of what’s important. When I think of the things that matter to me as somebody who’s looking after infrastructure, versus something that’s important to perhaps a back-end developer or a front-end developer, and ultimately the end user, who has to use whatever it is we’re putting in front of them - there are different things that are important to us at those varying levels.
I’m assuming telemetry is useful in all these areas, but ultimately, the business cares about the end user experience… So how do you approach – when you gather a team and you’re about to start doing the work, at what point do you start carving out the things that are important for the different teams and the different stakeholders?
I think it’s iterative. Again, I would say that a very large project would function quite differently. That would be part of the design decision, and it would be built-in from the foundation, and it would be quite a complicated approach to telemetry, because it would need that… But in medium to smaller projects - and by medium, I mean it still could be a large project. I consider our product to be medium in size, and it’s six million lines of code. In that type of project, I think you can iterate. So we start by saying “Okay, let’s make sure from the code point of view developers are storing the metrics that they think are important, and we can always add more. Let’s make sure from a metric performance point of view and from a systems and scalability point of view we’re storing the things that we think are important.” And then when we have problems, it becomes immediately obvious where telemetry is missing or where telemetry is useful. Because that’s the funny thing, you don’t always know.
Let’s say you’re looking at a cloud engineer point of view, and you say “Okay, my telemetry is showing me that my CPU usage is 80% on my ten cloud instances at Amazon, and I probably need another 20 instances.” But you might not notice that there’s 200% or 2,000% more failed logins per second than there normally are… And actually, what you’ve got is a brute force attack. Now, if you’ve got all of these metrics - this is jumping forward a bit, but I think the best thing with telemetry is to store as much as you can, and have somebody look for anomalies.
If you have that type of setup, where you’re saying something is a statistical anomaly, then when you go to say “Okay, what’s going on here?”, if those things pop up - far more requests per second from a certain country, way more failed logins than usual - and then all of a sudden you realize that the problem is not what you thought it was… Or you don’t find it, and you have to start digging and digging, and then you implement a way of tracking that in the future.
So if you’re starting off with your telemetry, and say you don’t have a clue what to start with, like you’re somebody who just hasn’t gone about doing it, what are the first few metrics you would suggest they try out?
[11:48] I would say it’s probably broken up into three areas. The first area you’ve got is your actual server. Whether it’s a cloud instance, it’s a VM, it’s a container, whatever it is, the actual system that’s hosting it - most people don’t realize how far down the journey of telemetry they are… Because they can tell the CPU usage on there, they can tell the memory, or they can tell if it’s online or offline. That’s a data point, right? Like, is the server working or is the server not working? So you start to monitor things like that, and you start to have some basic understanding of your server, obviously, and servers.
The second thing is your network. That’s where most scaling and telemetry information and data becomes very useful. The time of up and down sites is long gone, but what if your website’s or your API’s response on average is 200 milliseconds, and there was a deployment last night and now it’s 400 milliseconds? This is very important information to have. Simple things like HTTP reply times, and your HTTP reply statuses, for example. How many 200 codes are there, 400 codes, 500 codes… Just picking up that there’s 5% of responded pages are errors, versus 0.1%, can really help you to shortcut an issue.
And then the final one is the real key, the fundamental area of telemetry, which is in your app. That would be starting to track the stuff that’s important to you. If you’ve got a key-value store that you use for caching, track what’s your cache hit rate. If the cache hit rate hits the floor, then you know something might go wrong. What’s your database’s response time like? How much cache are you storing? How many logged in users are there? All the components that make your site work, you just start tracking and tracking and tracking. And it’s so easy… You literally have a function called stats.Gauge(10) users.Total, stats.Gauge(12) users.Total. You just start to track that stuff, and you find out that actually it’s not hard to implement. The much harder part is taking responsibility for that data and using it. But that can come. The first part is storing it, having it available and understanding how it impacts your service or application.
Whose responsibility is it to care for that data, as you suggest? Is it the operations team, is it the engineering team, is it the product manager, is it everybody?
In maybe traditional structures, in a traditional way you’ve got IT operations, you’ve got security and you’ve got development, kind of separate houses. The more siloed it is, the more likely it is that they will have pieces that they care more about than others do. But that’s kind of like a crux of telemetry, because like I was saying, if you don’t see the whole picture, you might not see – if you were just in IT ops, you’d launch ten more servers, instead of realizing that you’ve got a security problem.
We work primarily with what you would call more modern types of deployments, I guess; a lot of Kubernetes type stuff, cloud-native people, things like that… And interestingly, the use case there is quite different. It’s a word I hate using, because everybody has made it just mean everything, but it’s like a DevOps type of role. What that means to me is someone that cares about the application as a whole. So they don’t care about the code, or the server it runs on, or the cloud they use, or the firewall, or the load balancer, they care about the whole application… And that team will normally be in charge of the telemetry and monitoring of it, and everything… At least in our experience.
I think one of the other questions is - you know, you mentioned a bit about anomalies, or some teams, some organizations prefer to set some SLOs, and they produce some alerts as soon as some of the metrics are out of the boundaries… And I think each organization has a different strategy. Some organizations prefer a monitoring team or an SRE team to be reactive to the alerts, and then they escalate it or delegate it to other teams - to the first-responders, versus other folks, and so on. It has a lot to do about the organization and the way the company/organization works, right?
[16:02] You make a good point, because I was talking almost from the angle of saying “There’s something wrong. Let’s look at the telemetry”, but the next kind of natural step from that is exactly like you’re saying… It’s to rather have the data be presented to people when things are picked up, like anomalies and that. And yeah, the bigger the business, the more likely there is a team that is responsible for that… But that doesn’t mean that smaller businesses can’t use open source free tools to achieve very similar types of results.
We talked a lot about metrics, but you specifically mentioned that our systems are getting larger, and there are a lot of different components… Recently, in the last decade or five years, distributed tracing and logging especially, correlated with a trace IDE or a request IDE has also become very popular in terms of collecting signals. Some organizations at least use them as another source of telemetry… What do you think about that?
Yeah, I think it’s critical the larger the organization is. The reason why I am kind of choosing my words carefully is because it can be quite difficult to achieve in an early project, or to add to an existing project. You’ll often find that level of scrutiny is quite challenging for a smaller business (or a medium size business even) to achieve. We’re jumping forward a bit, but if you take a look – we ourselves could have 50 devices at a client, and each device could be generating 100,000 lines of logging a second… And for a company to actually store that information is often beyond their ability. That’s the nice thing about – if you have all the hooks in to get this information, then when you need it, you can grow into it.
I’m interested in understanding the telemetry landscape a little bit right now… You mentioned obviously at your company, at Snapt, that’s the business you’re in, so you likely have an understanding of the landscape right now… We hear about these projects, but we don’t really quite know where they fit in. I’m thinking of things like OpenTelemetry, OpenTracing, OpenCensus… There’s a lot of these open source projects that all seem to have overlap in terms of the problem they’re trying to solve… But to me, it seems like some teams decide “Okay, we’re gonna adopt OpenCensus”, whatever that means; then they go find the clients, they find the severs, and they do their thing… And now you wonder “Okay, when there’s a standard, if there is a standard - do we retro-fit everything?” It seems like right now there’s a lot of churn in that space… Can you lay out the landscape for us here?
Yeah, there is a lot. It suffers from that same DevOps state where people have wound up building their own in a lot of situations. I don’t mean building the entire stack, but a lot of tooling and custom work to get things to work the way they want. By far, what we see the most are people using things like Prometheus and Grafana and stuff like that to dashboard and visualize stuff… Because most of the companies we work with, it will be mostly internal, their collection of the information and their ability to send it somewhere… Because it will be from different apps, different stacks. It could be some data coming from Microsoft servers, some coming from containers, some from Amazon… But they’ll often have a single source of dashboarding and reporting and analysis for that, so that will usually be something like Prometheus, or something like that, where then they can automate a lot of the anomaly detection, and visualization of that data, and stuff like that.
So it’s a pretty developed space in terms of how you see that information once you start to store it and keep it in a time series database, and all these kinds of things… But it’s really up in the air with how you track, how you communicate. Probably the biggest thing we see are people that are just using StatsD to stream telemetry data to something, and then collect it and ultimately output it into some sort of dashboarding solution.
[20:05] Yeah, as a person who has some experience in this field - I used to work on OpenCensus, and I think we were trying too hard to maybe unify the approaches; unify the export types, the exported data, or unifying the library space, or trying to establish standards… But it seems like the field is very crowded, and it’s just hard to – maybe it doesn’t make much sense, because at the end of the day, all you care is getting the data to a dashboard and be able to utilize the data… And I think that’s primarily what the organizations care about; they don’t necessarily care about the export format or the library they’re using to instrument…
In a lot of cases they don’t even care about the reliability of it, and that’s one of the challenges with that space as well. If your telemetry data is something that you’re collecting every second or multiple times a second, losing some of it doesn’t matter, in most cases. If we for example are writing the response times of an API the whole time, we stream that information through UDP, and we don’t even check if the destination got it… Because we’ll pick up that node data has been plotted for five minutes, but if one packet drops, a lot of the time with telemetry that’s not a big problem. That’s often internally developed, how people get that data out… And much more, the standards seem to be on the display of it and the detection of it. But like you guys mentioned, there are a lot of projects starting out there, so maybe it will clear up on that.
Often, when people build their own things it’s because there is a need, but you also have to deal with the fact that there are so many people that build so many things now that it’s – it is a bit of a web…
True. Also, there’s a lot of pre-packaged software and cloud platforms that can export a lot of telemetry, and there’s no standard around where they would export, or what data format it would be… It would be nice to have some sort of standard at least, so we can go and talk to all these open source projects or the cloud providers to export some telemetry out of the box… Because everything is a black box when you have a prepackaged something, or like a vendor solution, right?
Exactly. We have that problem with our product… We do our own dashboarding for our servers and systems and things, but when we ultimately wanna let people integrate that into their DevOps tooling or their environment, it’s like how do we get that information out? So you provide a REST API, then you provide a webhook URL… Because you’re trying to find some way to fit into what they do, and there’s no standard… That’s 100% correct.
If you’ve been in this space for any length of time, you’re gonna hear the term “observability” quite a bit, right? And we know that telemetry plays a part of that, but oftentimes it feels like it occupies a very large slice of the pie. I’ve heard people talk about the pillars of observability, and metrics and tracing and logging, and all that… What are the concerns that one has in terms of observability? When I say I want observability, what am I really asking for here, and how does telemetry help answer these questions for me?
I think the term observability – like you say, there’s pillars of it, there’s all these things… But to me, it has seen a rise in popularity lately because of exactly what we were just saying, this black box effect that things have. So really what it is – let me give you an example in our world… You’ve got one web server that runs your API, and then you have to scale that out and you’ve now got two. Then imagine you scale that out a lot, and instead you’ve got 30, and they’re in multiple data centers… And it’s all going through some load balancer, and someone says to you “Oh, every time I use my Android phone, if I’m in South America, when I try to log in I get a 500 error.” That’s to me observability. It’s like a needle in the haystack. The problem just becomes so compounded when everything is being funneled through one point and then split off into all different directions.
[24:23] The rise of observability I think actually comes out of trying to problem-solve, trying to debug issues, and not being able to see them… Where telemetry came into play and you said “Okay, you know you have an issue. Let me look at the general health and well-being of my system at large in order to be able to see where I should focus down.”
If you were looking for that problem, perhaps you will notice that Azure data center has 5% errors on requests, whereas your Amazon one has 0.2%. So you know “Hm, it seems like something is going on in my Azure data center”, and I can start to drill down there. And that’s where then the rest of observability comes in, like how accurate can your logging be; can you actually look for all 500 errors that went to all the web servers in this data center, and then find the web server that it went to and dig into that…?
But at its core, observability to me is just really being able to see through that veil, to actually see what’s really happening, what’s the traffic look like, what are the valid requests, what are the invalid requests, where are things breaking, and not have it obscured by a cloud service, or a firewall, or a load balancer, whatever it might be. It’s almost like a simplification of the complex system that things run on now.
It’s even worse when you get into things like Kubernetes, and the pod you’re trying to see (that the error was on) has been destroyed and it’s just gone now, and where’s that data… It really starts to get hard. But that’s really what I think it is - it’s just about being able to see in a simple fashion, and as simple a fashion as you can what’s going on… Because either you want to prevent something going wrong, or you’re trying to discover what is wrong.
Yeah, one of the definitions that I heard and I liked is “Observability is more about asking questions that you are not prepared to ask.“With typical sorts of metrics and so on we basically know what we are looking at. We plan so we collect metrics around it, or eventually we learn over time that “Oh, these are some of the failure modes, so we should maybe better collect more metrics around that.”
Observability is a broader approach to be able to utilize whatever you collect in order to be able to answer some of the questions that you’re not prepared to answer.
Exactly. My simple example in the beginning that you don’t need more servers, you need to stop the brute force login attack - it’s that kind of full visibility of the system… Because what you think is wrong may not be what’s wrong. And if you can see all of the moving pieces and components, then you can hopefully see what’s actually happening on your system, and ideally prevent an issue, but also debug an issue.
Let’s take it down one level a bit… So if I’m a Go developer - obviously, we have a lot of Go listeners on the podcast; I’m not sure if you realize that, but… They are going to want to understand not only basically “Hey, I’m a Go developer. Where do I get started with telemetry? What do I measure? How does Go make it easier or harder, or simpler?” Basically, they have these concerns… But in all of our collective experience, does Go make the job of collecting or emitting or whatever we do around telemetry in our projects - does Go make that harder compared to other projects? I’m curious…
I don’t think so. About 50% of our stack is Go. We’re using it exactly in the way that I described to you, and developing products for clients that do it in the same way… I think it’s actually quite easy. It’s very easy to get that data out in an efficient way. Obviously, that’s one of the easiest, the nicest things to do - you can just dump that data out and you don’t have to worry about it affecting the performance of your program; that’s also really nice when you look at things like telemetry… Because you don’t want the telemetry to ultimately become a bottleneck in your platform. That’s why I said UDP, for example, is very popular, because you can just fire and forget. And it’s very easy to do that with Go.
But Go itself in everything has telemetry. When you look at telemetry, we often think “Okay, it’s very advanced measurements around very specific application-focused things”, but your garbage collection is telemetry; how much memory have you freed, how much memory have you allocated, what’s your current usage… All these kinds of things are telemetry, and once you start to monitor that stuff, you start to think of things that you might also want.
We have a client server app, so we output from our Go server system; well, how many people are connected right now? Is that changing? How many requests per second are those people creating? And that’s all just simple telemetry; we don’t even use a third-party library, or anything… We just (like I said) fire and forget a UDP send out of it.
So in my opinion, I would say it’s very easy, but then I think it’s easy to do it in any language. I certainly don’t think Go hurts, and it’s very easy to do it in a performance-sensitive way.
I personally wish that there was an easier way to export… You know, if the runtime was writing to a UDP port by default or something, that would be much easier. A lot of times people learn to care about telemetry at a later time, as you said, and it’s really significant if they were able to just turn on something and collect that data in production, or sometime when they need it.
There’s been a lot of discussions around the standards, I think primarily for this reason, because we wanna be able to address “Oh, how can we make people turn on maybe collection at a later time, and collect as much as possible and utilize it when the user needs it?”
So I think there’s one particular thing that we may take care in the long-term, and that’s this - being able to collect at a later time.
It’s so difficult… Because I agree with everything you said, and then at the same time, it’s a hard problem to solve, because the important metrics in one app are totally different from another. But I do agree that if there was a very easy, accessible, well-documented – you know, the lines of code for the project would probably be very small… But a well-documented source that people could use just as the book on what you should store from a Go app, and what foundation you should start with… I think that would encourage people to not have to go back in time, like you said, and add to it.
[32:23] Yeah, I altered a page on the golang.org/doc/diagnostics but it’s never a document that people read through before they push something to production… So maybe we should do a better job explaining the whole production-related issues.
Often a popular package does a better job of getting a readme across than a page…
A package that has a lot of stars, that a lot of people use, you see “Oh, everybody’s using this…” And it can be 50 lines of code, but if it just sets the standard for what you think, then that’s quite a good way of getting that.
That’s such a really good point. The number of times that I just published some packages, very small packages or tools - it’s because it was hard to give the user an entry point… So you just make it a small project, and then people start to like it, and share it, and it becomes more of like a de facto thing. It’s a really good point, that presenting it as a project or some utility tool is a really good way to spread the word.
Speaking of packages, Go has - curiously enough - an expvar package that’s built into the standard library. If one’s curious, or if one’s kind of scratching their heads wondering “Well, that looks awfully like some sort of mechanism where I could be collecting metrics and instrument my Go code, and expose that to something that’s gonna come scrape it”, or something like that… Should folks be looking at that as a starting point for instrumenting their code in Go? What are your thoughts?
Can I explain something about this?
Basically, that package has been modeled after varz. Varz is a convention at Google where you have some keys and values, and you can basically in the binary register any key, and then set a value. The expvar package was very identical to the varz libraries at Google; I think they needed it because some SRE folks demanded it when they were first going to production with Go… But over time, varz turned out to be like “We think that it’s not very scalable”, because people just dump a lot of random things, and then the name space is becoming very complicated, and so on… So they sort of like deprecated varz and switched to a different model.
I think in 2.0 there’s a topic around this, that they’re thinking about deprecating expvar and maybe replacing it with something better, especially if there’s an established standard; or they’re going to reconsider it for Go 2.0. That’s the background story… But you know, we can still discuss if it’s useful for end users.
Yeah, we are by no means the authority on collecting telemetry information. We focus on a very specific sector of application telemetry, and then we process it and report on it all ourselves. But in my personal development experience - not from a large-scale project or anything like that - I’ve found that it’s better to fire and forget telemetry than to expose a telemetry collection point. I don’t know if that’s really where the standard will go… Maybe people will point to this podcast as where I was wrong about what the future of telemetry in Go would be… But you know, exposing a bunch of almost what I would call debug stuff as the solution to telemetry is a bit of a slippery slope… As opposed to saying “This is a metric that we care about for reason X, and we’re gonna send it to location Y, and in the future we’ll use it for various things.” Because one of the biggest parts where you start to learn what telemetry you need and how to use your telemetry is when it actually either helps you solve a problem or doesn’t.
[36:14] If you’ve got an issue and you’re able to see where that issue is through your telemetry, then you learn something… And especially if you cannot see where it is through your telemetry, you learn something. We’ve had that, where we’ve said – you know, we’ve had this performance problem that we’ve ultimately found, and our telemetry didn’t find that, and so we’ve added more tracking in that piece of the code.
I think it’s almost just like a dump out on some HTTP GET that people need to then collect data and pretend to process it in place… It probably doesn’t actually solve the developer problem of making sure that the things get used… But that could just be my personal opinion.
This is actually a very good topic, and it’s still a very relevant thing - what is the best way to pull metric data, or to make the process push. We currently think that scheduling the pushing is better, because at least the process knows “I can schedule the push.” Even if it’s not just like a UDP fire and forget type of a push, the process has a better chance to run this in the background and just do the push whenever it’s better.
In the pull model, imagine that a server is receiving a lot of traffic, and there’s already a huge workload on the server, and then your monitoring system comes in and tries to pull, and doing a bunch of work in order to just be able to generate all the values of the metrics and present it as an HTTP endpoint, in the Prometheus endpoint fashion - it’s just kind of overloading the process. So instead of that model, it’s much better to push… But you know, this is still a controversial topic, because it also depends on how you deploy your monitoring stack.
I think the pull model came from – Prometheus’ pull model is coming from Borgmon, because at Google initially everybody was deploying their own Borgmon instances… So they’d kind of have more of an overall control. They shifted to more of like a central, globally-scalable type of monitoring stack. The requirement - it’s almost like you don’t have to care about the availability of your monitoring stack at all, and you don’t have to strictly position your monitoring stack or collector with the processes you have… So they had more flexibility in terms of pushing. But they didn’t hit this as a bottleneck initially, because there were other problems such as maintaining your Borgmon instance, and so on.
But if you have a globally available collector, pushing is much easier, because at least the process can tell “Oh, I don’t have much traffic right now. Maybe this is the better time…” Because you know, exporting metrics is important, but it’s not as important as serving the user traffic, right? So giving that flexibility to the process is really important.
Yeah, I couldn’t agree more. That’s the nice thing about pushing - you can go all the way from fire and forget, like I say, which is really nice, because then there’s no headaches around that… But if you go further up, you give the process the ability to decide what’s important and what’s not. If it’s about to fail, it might block to send that message, to say “Listen, we’ve got a serious issue here.” But on the other hand, if it wants to decide that it doesn’t need to store telemetry information right now because the system is overloaded, then it can do that as well… Whereas with collection it’s just a static – it’s almost like you’ve got a cron job, which is [unintelligible 00:39:52.22] and gets a whole bunch of pages, regardless of what’s happening, and you just dump stuff onto those pages.
Yeah, if that answers your question – I think our approach has always been to push the stuff out where possible, and to let the app decide what’s important and what’s not, and how it wants to deliver those messages.
[40:12] Since you’re talking about the UDP, do you have an agent that collects…? What is the collection model like?
For us – so we’ve got two sides of telemetry, really… We’ve got our product, which collects specific telemetry for our ADCs and load balancers and things like that… But then more so I’m talking about for our own internal use, like for our code, and our hosting systems, and all that kind of stuff… And for that, we just have our own – again, that DevOps, hacked everything together… But we have our own collector thing in the middle, that does a whole bunch of various things with that data. And the reason that that happened was because we use it for some of our actual applications, our clients’ telemetry as well, specifically for anomaly detection in it… So it does all of that stuff for us.
But then some of our data - we stream directly out of that UDP fire and forget, and we send straight to Datadog, for example. So we even explored off-platform, some of our shared, SaaS-based hosting things, and then other stuff we keep in product. So we’re exactly that bad example where we kind of built it ourselves.
It’s interesting that a company whose product is collecting and exposing some of that data is actually using another company who’s able to display that… So is this a case of – I’m wondering if this is a symptom of sort of “Basically no one tool or platform that does it all, or that answers all the questions you might have”, so you end up having to pull in a bunch of different things in order to get an overall observability answer?
Exactly. Because we don’t answer all the questions. Our product is sitting at the entrypoint to the network. It’s an ADC, so you’ve got load balancer, security, firewall etc. for the traffic that’s coming in. So we’re reporting very specifically on that information. And that means that we also then need to offer that information out to our clients to integrate with other things… Because if they’ve got a problem in that space - yes, they’ll come directly to our platform and look at their reports on their data, and things like that… But if they’ve got a problem with the app at large, we need to just contribute our small piece of information to their overall telemetry. So it’s quite common for us to ship information off our platform to theirs, or expose it in some way.
Generally, we’ve tried to be as open as possible, especially when we deal with larger enterprises. They have almost all got their own use case, and as wide open as you can make your platform, I think ultimately it’s the best. To the points earlier though, there’s not a lot of standards, so we wind up adding seven different ways of getting data out of our platform, because that’s what’s needed… [laughter]
One of the interesting things that we realized when initiating OpenCensus was a lot of our large customers were dependent on multiple products… And sometimes this is about really trying to get some additional stuff, additional feature from a vendor, and sometimes it’s about the team preferences. In a very large organization, a team is like – they like Datadog, they wanna use Datadog, some other team wants something else… So we thought that having something vendor-agnostic is really the key. You can’t really lock that type of data to a provider; that’s not going to be useful for anyone… So being able to export to multiple vendors was also very important in our case.
I think that’s 100% true. When you look at the more traditional model, you’ve also got multiple stakeholders, who only want certain pieces of the data on their certain platforms. You’ve got IT ops and you’ve got security, and they can run totally separately… So I think that it’s critical. The way we have wound up having to do that is by building it ourselves.
So you talked about on the ADC side of things you’re collecting certain telemetry… Can you share some of the more important ones you feel like you guys are collecting, and where customers have found them to be useful?
Yeah, absolutely. Our newest product is called Nova, and it’s our kind of cloud-native-focused scalable ADC. An important component of that is that we run many ADCs centrally, so it’s like a control plane/data plane model; we are collecting a lot of data from the data plane to display on the control plane… But we had a lot of learnings in our traditional product sample, which is like a standalone ADC.
But what’s interesting is that we’ve tried to tackle it in a very different way. We collect mostly the same data - how many of every type of HTTP reply code are you getting? How many requests are you getting? How many TCP connections? How many TCP connection failures? How many timeouts are there? What’s the reply time? And when you look at the response times, there’s a lot of information there. Like, what was the TCP connect time to the server - is there a network issue? What was the HTTP reply time from the server - is there a back-end issue? What was the response like to the client? How long until we closed that session with the client - is there a front-side network issue?
There’s all of these metrics, but what we’ve tried to do - and time will tell if our approach is interesting enough or right enough now… What we’ve tried to do is not put any hardcoded values in for any of those, but rather to do just like anomaly detection and predictive profiling of what we expect the data to look like. Because one of the things is our system autoscales, so it will pre-scale, so it needs to do a lot of prediction off of those numbers. So we’ve wound up in this system where we collect a huge amount of telemetry and we set no hard lines for what should be alerted, but rather just if it changes too much… And so far that’s going well, but I think it’s a little bit odd for some people, because they wanna say “Well, I expect my website to respond in 200 milliseconds, so if it’s ever more than 250, please tell me.” And instead, we’re saying “Well, if it always responds in 200, then we will tell you if it’s 250. But if it doesn’t, then we won’t.”
So all of that type of stuff is your traditional things that you expect, like what’s throughput of the collect, or the request rates, the response codes… Because you can pick up a problem long before by saying “Oh, I normally generate 0.1% errors, and now I’m generating 0.5% errors.” You might not notice that, but it means something’s changed, and it could mean that something’s about to get a lot worse; it could mean that there’s a security issue, and it could mean all of those things.
But by the same token, we will also check for variances between two things. For example, if the average user sends far more GET requests than POST requests, but one user is sending far more POST requests than GET requests - is this a security issue? Are they trying to brute force a password, is this something weird? Is a specific user getting way more 404 errors than everyone else? Why is that? It’s probably some script, or something. So telemetry is often a combination of two values, like “What is this value versus that value?” as opposed to just a single value. So that’s a lot of the stuff we focus on.
[48:09] The client connects to us, we connect to the web servers and we send their data back. That’s our model… So everything in that communication chain is the telemetry that we care a lot about, because it could mean that there’s a problem with the client servers, it could mean that there’s latency or issues that are affecting the user, or it could mean a security issue… So that’s the type of stuff we need to obviously track for scaling up and scaling down, as well as for alerting the user to problems with their service.
Yeah, it sounds like a very difficult field, especially given the trends of traffic can change, the usage can change… You have to incorporate all of that in order to actually be confident about the detection, right?
Exactly. You know, people tend to use the word “ML” here, right? Machine learning. That’s what they tend to say. But really, it’s just a statistics problem at its core. You’re really just evaluating numbers against other numbers. We do work with some ML type of stuff because of exactly what you’ve said - traffic patterns can change very rapidly. In one minute you could have ten times the traffic than you do in the next minute, but they change in a way that makes sense if you look at all of the data instead of one data point. For example, your throughput will go down in a predictable fashion with your request per second, as will your HTTP 200 replies, as will your POST requests, as will your CPU usage on the sever, as will your network latency… And if something there doesn’t decrease at the same pace that something else decreases, then anomalies become very obvious to a system that’s looking at the data as a whole.
Where we’ve had a lot of difficulty is weeding out all of the trash that it picks up… But that’s kind of our value-add to that, I guess… But really trying to find the balance of saying – you know, because the worst thing about a telemetry and analysis and visibility/observability platform is if you generate so many alerts that people start to ignore them; then that’s a total loss. Rather have too few, so that you get the really important ones through… So it’s quite hard; it is difficult, and also especially as we are scaling. We’re trying to pre-scale based off of that information. So it’s quite a balancing act.
One of the biggest learnings for me personally with telemetry, something that I’ve learned from our team is that things start to make a lot more sense when you’re looking for anomalies in sets of measurements, instead of individual measurements. I think that’s been a big, core design factor for our platform.
I think that all makes sense though, because anybody who’s ever been on pager/had a pager or anything for a product knows that when you get paged for too many anomalies, you’re basically to this point where you just assume “I’ll wait till it does it again, to see if it’s actually a real thing.” And when that’s happening, it’s like “Okay, that defeats the whole purpose of the system we have in place”, because people are ignoring things… But then you also mentioned, you will get some – if you’re working on anomalies, you can get some spikes in traffic…
I remember one of the ones that stuck out to me was I was helping with Google Code Jam, and it’s one of those competitions where everybody logs in at the exact same time, because that’s when the competition starts… And I believe at the time the way that they were doing some of their monitoring stuff was basically the same thing - look for anomalies. So the guys who were setting it up basically knew that you sort of had to warm up the servers ahead of time. So it was this weird thing where you’re like “What are you doing?” and he’s running a script to sort of get the server used to this request load coming in… And it was just because that was the simplest way to ignore that anomaly, because you knew it was coming… But it really tended to happen during very specific things.
If you have a timed event, and that sort of traffic spikes, then it becomes very challenging… And I think that’s probably also – you see video games and stuff like that that have a launch date, and I think they have to deal with that type of problem pretty heavily, where it’s hard to detect an anomaly when everything just skyrockets all of a sudden.
[51:56] Yeah, exactly. But a lot of the time an anomaly can be informative. I think that’s also up to the team that gets them, to make sure that they do the right thing. If our website gets ten times the views after this podcast, I’m happy to be told. It’s not gonna go offline… But you know, sometimes informative telemetry is not necessarily a problem, but yeah, it can reach the point of spam, which then people start to ignore, which is a big problem… But you know, it’s a balancing act. With alerts and with anomaly detection it’s all about balancing it; you wanna make sure you pick stuff up.
The problem at our scale becomes so vast, because – let’s use a use case. Let’s say a banking client of ours - they might have systems in 20 different countries. Now, how many failed logins per second do you think they get? It could be 500, it could be 1,000 that they get per second… So if they’ve got 10 more, or 20 more, or 50 more, it might not detect an anomaly. But what if they get 50 more in all of their locations around the world, all from the same country? Is that a problem? Probably it is. So sometimes – that’s the funny thing about telemetry, people tend to zoom all the way in.
We were talking about trace IDEs, and “What is the individual request?” A lot of the time that’s very important, but sometimes it’s actually really important to have that 10,000-foot view, where you’re just like “What is the lay of the land? What does it all look like as a picture?” And that’s also something that’s not that easy to do now. There’s not a lot of standard stuff for that, or just best practices, like “How do you set up your dashboarding, if you’re using Prometheus, or if you’re using Datadog, or whatever it is…?” Using that is like a big failing, I think, in DevOps teams and traditional teams today. It’s making sure that you always go back to your telemetry and say “Why didn’t this tell us about this problem before it happened?” There should be almost like a root cause analysis of the issues… And it doesn’t have to be this fancy process, but just going “Why were we not aware of this, now that we understand what it was?”
Yeah, that’s such a really good point… Especially large teams, large companies - if they haven’t thought about telemetry in the beginning, they wanna introduce it at a later time, but they don’t know where to begin. Anomaly detection really helps them to explore the area as well. It’s not super-obvious to you, but you can maybe run it just to see and explore all these edge cases, and some of the critical things, the correlations… It may actually help you to explore what you need to take a look at, even if you end up having an SLO type of approach in the end.
Yeah, you said the most important word, which is correlations… And a lot of the time it’s not obvious to the human eye, but it can really help when you’re trying to scale systems. The nature of scale has changed so much now. You can scale up easily nowadays in the cloud, or in containers, or whatever, but the difficulty and the challenges at the languages that we write in are so high up the stack that a lot of the time the difficulty in diagnosing the bottlenecks or the performance issues of things can be very hard… And being able to put two data points together and understand that that’s why something is slow, or that’s why it’s never gonna scale, even if you put 3,000 servers behind it - that can be helped a lot by anomaly detection.
Yeah, this is more art than science, it sounds like…
It really is. I read this funny thing the other day about the difference with developers - some are artists and some are engineers, or whatever it was… But it is like an art, because you really need to say “What are the what-ifs, and what can I just store and see what I get out of it?” It is experimental, I think.
Okay. One last question for you before we go to our next segment… Based on what you’re seeing so far and how your customers are, or what you’re seeing in terms of the data that you can glean from how your customers use your product, are most of the things that trigger something to look at, something that’s important to look at, of the anomalies - are most of those triggered from internal sources, meaning that the developer is pushing new code, making changes that’s causing issues, or are those coming from the outside? Maybe there’s somebody who’s trying to brute force their way in, or maybe the company just got listed on some popular website or something, and maybe there’s a surge in traffic. Generally speaking, where are the biggest sources of problems?
[56:25] Generally speaking – the answer I wanna give you as well is “Well, it’s both.” But let’s pin me down for an actual answer…
If my back was against a wall, I’d say usually it’s the servers and the apps that fail, in a condition which the team has struggled to test. Testing things nowadays can be very hard. Take our platform - we need to test ten million active connections, ten million active devices connecting to our platform. How do we do that? We’ve got six Kubernetes servers that are running on 2,000 machines, and it’s still a nightmare. So you get these systems where people scale things up, and where people put things in auto scale groups and everything, and ultimately there’s still some bottleneck and things fall over that they just couldn’t test. It’s like Black Friday; if massive e-commerce sites can fail, I assure you yours can too, in this unpredictable load.
So the reality is that most of the time it’s that that’s failing… But what’s interesting is that it’s often easy to predict that it’s gonna fail, and allow them hopefully time to correct for it. To predict that page load times are slower than normal, that traffic is higher than normal at this time, on this day, we’ll go all the way down to DNS queries. If there’s way more DNS queries coming in than we normally have coming in for the site, versus the requests per second - all these kinds of things…
So usually an issue is downtime on the upstreams; the actual origin servers for the API, for the website, for the e-commerce store, whatever it is. But the cause, the reason that it happens - it will often be a burst in traffic, or something unexpected, or some new feature that gets rolled out, or a change in the database system. Someone upgrades from one SQL to the next SQL version, and the query cache is now no longer one gigabyte default, instead it’s now zero by default, and the whole system falls apart… You know. But you can start to see that, because like at two in the morning actually the page load times got worse. And if someone could see that and say “Hang on, at six the page load times are worse. What happened last night? It won’t fail over at 9 when the traffic starts.”
I think that’s the beauty of telemetry, is understanding those unknown changes. And you upgrade your SQL server, you go to the website, everything works, and you think “Phew… It’s working.” You don’t know that there’s been a 25% page load time decrease, because you can’t feel that… But when you get hit by 100,000 requests per second, you feel it big time.
Jon, would you like to introduce our guest to our next segment?
Sure. Mat started this segment called Unpopular Opinion… And I think right about here they put in some little riff…
Basically, the idea is we want you to share an unpopular opinion you have, preferrably in tech, but it doesn’t necessarily have to be… With the goal being to just share with listeners that not everybody agrees with the really popular opinions, everybody has different things that they disagree with, and wanna share.
Yeah, absolutely. Well, I would be remiss if I didn’t say this, but I’m vegan, and that’s pretty unpopular… [laughter] But if we’re talking about tech, the biggest thing that I have made the mistake of myself, and that I see a lot of small companies doing - you know, I work with a few startups, or helping people, a lot of our earlier stage companies that join… We have a community edition, which is free, so we get to communicate with a lot of these guys pushing the boundaries of things that they are doing, and get in touch with them… One of the things that is probably an unpopular opinion is that I think that startups and a lot of people are writing code in the wrong languages, almost all the time.
[01:00:27.06] So they should be writing in Go, is what you’re saying. They should be using Go.
Well, it depends. How many Go devs do they have on their team? So that’s the point, right? That’s the point. I think people at young companies choose the language based on how trendy and how cool and how high-performance it can be. But no one really wants to maintain a wiki that’s been written in Erlang. And a lot of the time, people are not worrying about how easy is it to hire talent for this, how easy is it to scale this? How well-known is this in the developer scene and in the market?
I think much more likely your app is slow because your code is bad than because the language you wrote it in is bad. When you get to that point, then you’re past that struggle… But this tendency to always chase the latest language I think gives people business scaling problems, and it’s very difficult to get talent for it, and it’s very difficult to build an engineering team around it… So yeah, I would say that I think people are often choosing the language that they use incorrectly.
I don’t actually think that Go is an example of that, because it’s one of the ones that I think is very easy to pick up, and to learn, and to get resources on, and to find people that are playing with it, for whatever reason… Like, it’s done well to get a community. But a lot of people will just write in whatever the last podcast or webinar they watched was using… And I think that’s a mistake.
I can definitely relate to that. I’ve seen it go the opposite direction too though, where the general advice is “If it’s gonna be three of you, you’re probably gonna work on this thing with just the three of you for six months to a year before you can really afford to hire.” Maybe not always, but a lot of times that’s the case… So it’s like, in that six months to a year, how are the three of you gonna be the most productive? So you kind of pick a language based on that. And I say three - it could be one person, two people, or however many people.
I’ve seen companies that start with really old languages as a result, and – what was the one…? I think it was Perl; they used Perl, and the only real issue with that… They were productive and they got a lot of stuff done, but I think they struggled later, like you said, with hiring, because when it comes to hire later, you’re like a trendy startup, but everybody looks at the language you’re using and they’re like “Yeah, that’s not exactly my first choice…” Nobody really wants to spend time learning a language that is probably not gonna benefit their career in the future.
If you’re learning Go, you’re like “Okay, this is gonna at least benefit me in the future if I find other companies that are picking it up.” But if you’re learning a language that’s not dead necessarily, but it’s not growing, then it’s a little bit different, too. So I guess I see what you’re saying, but I guess I’d also take caution on the other side of it and say “Don’t use something that’s also gonna cause problems because it’s so old, or you just know it so well. Even though you know it that well, it might still present issues.
Yeah, you’re exactly right. The crux of my point is make sure that it’s easy to hire people with that language… Which I think is exactly what you’re saying - there is a balance between something that’s growing steadily, that’s got a lot of acceptance and people talking about it, and it’s also actually being used at companies, that people have production experience with it… Because just because someone understands a language doesn’t mean that they know how you build a massive app and keep this thing online and actually deliver it in that language, and maintain it.
Frameworks are a good example. “Stay away from frameworks, because we must write this thing as bare as possible.” Write it in Assembly then. If a framework is gonna make your team of three get five times more work done in the first six months that you’ve got to get your MVP out, then use that. I would say find that middle path, but I think people are far too far forward and you need to caution them to go backwards.
[01:04:15.26] Yeah, unfortunately I think that’s a problem that comes with all – I’m trying to think of how to word this… Basically, how you deploy things can also present issues, where everybody wants to use Docker and Kubernetes, and stuff… But I remember when Google’s App Engine was fairly new; I knew a couple of people who wrote a lot of stuff in Python, and they were like “Well, let’s go to App Engine. It’s going to auto-scale for us.” And their first project they really struggled, because there’s a lot of specific things you kind of have to learn about App Engine. Once they figured that all out - their second project, they would have flown on App Engine, but their first project was really a pain in the butt to deal with all these blockers that really shouldn’t have been there.
Yeah, you bring up a second unpopular opinion of mine. I mean, maybe it’s not unpopular, but… I don’t think that containers and Kubernetes and cloud-native is a destination. When we were all on tin, everyone said “Okay, everybody is gonna be on VMs.” And then were all on VMs, and now everybody is gonna be on the cloud. And that’s where it started to get shaky, because not everything did move to the cloud… And now this idea that the next step in that evolution is containers and cloud-native I think is wrong.
I think there are workloads that are excellently suited to that, there are workloads that are suited well to serverless, but there will always be workloads that are suited to tin that is within a mile of your house. I think it’s a spectrum now. It’s like, stop trying to make a round peg fit in a square hole. Not everything has to be deployed into containers, which is exactly to your point. It’s often the easiest thing, the thing that is used by the most people, that turns out to be the best decision.
I feel like the cloud is becoming more like the programming language industry; you have to introduce a new product or a new abstraction layer in order to get the attention of people. Maybe I feel like partially why we had so many different solutions was just because people wanna make some noise about it.
Yeah, you couldn’t be more right… And it’s interesting how the simple clouds are doing better, are growing quickly now. It’s something that we actually see in our business, because we compete with commodity load balancing in clouds, with ELB, or LB, or Azure’s Gateways, or whatever it might be; every cloud’s got a load balancer, so we compete with them. But what’s interesting is that people wanna be cloud-neutral now. So they wanna be able to say “Yeah, I’m in GCP now, but I can shift that to Azure etc” and they actually wanna use less and less of the commodity, proprietary cloud stuff, and try and stay neutral. So they’re delivering more and more features to keep on everyone’s tongues and keep talking about it, but I think people are steering more and more away from using one specific infrastructure provider’s solution.
Hey, new features bring new sales, man… You know, you’ve gotta factor that in. [laughs]
Look, we add new features all the time. I feel you. [laughter] Do what I say, not what I do.
[01:07:13.03] [laughs] Awesome, awesome. So in general then, we can say that orchestration and scaling have generally become easier. Telemetry is one of the hard problems still remaining, that a lot of people are trying to solve, your company included… And as we explored today with you, Dave, we know that there’s no one-size-fits-all. But at the very least, it’s advisable that everybody starts with something; that should be considered – having some form of telemetry that provides some form of insight into your workloads is at minimum required to be considered production-ready, to some degree.
Yeah. I mean, when you start working, if you put in a comment that says “To-do: add telemetry to this, because it’s a bottleneck”, then that’s fine, too. Just start thinking about it, and the rest will come.
Nice. Hopefully, you do a little more than just think about it, but… [laughs] Yeah, indeed. Thank you for joining us, Dave, and thank you to my co-hosts, Jon and Jaana. I am Johnny Boursiquot, and I’ll catch you in the next Go Time.
Our transcripts are open source on GitHub. Improvements are welcome. 💚