The Changelog – Episode #356

Observability is for your unknown unknowns

with Christine Yen

Guests

All Episodes

Christine Yen (co-founder and CEO of Honeycomb) joined the show to talk about her upcoming talk at Strange Loop titled “Observability: Superpowers for Developers.” We talk practically about observability and how it delivers on these superpowers. We also cover the biggest hurdles to observability, the cultural shifts needed in teams to implement observability, and even the gains the entire organization can enjoy when you deliver high-quality code and you’re able to respond to system failure with resilience.

Featuring

Sponsors

DigitalOcean – The simplest cloud platform for developers and teams Whether you’re running one virtual machine or ten thousand, makes managing your infrastructure too easy. Get started for free with a $50 credit. Learn more at do.co/changelog.

GoCD + Kubernetes – With GoCD running on Kubernetes, you define your build workflow and let GoCD provision and scale build infrastructure on the fly. GoCD installs as a Kubernetes native application. Scale your build infrastructure elastically. Learn more at gocd.org/kubernetes

CrossBrowserTesting – The ONLY all-in-one testing platform that can run automated, visual, and manual UI tests – on thousands of real desktops and mobile browsers.

Strange Loop – A conference for software developers in St. Louis, MO. covering programming languages, databases, distributed systems, security, machine learning, creativity, and more! Sep 12-14, 2019 / Oct 1-3, 2020 / Sep 30-Oct 2, 2021

Notes & Links

Edit on GitHub

“testing is for known knowns, monitoring is for known unknowns, observability is for unknown unknowns” – Jez Humble

Transcript

Edit on GitHub

Christine, in your Twitter bio it says that you miss writing software, and that makes me kind of sad…

I am kind of sad about it.

Write some software. You've gotta write some software.

I know, I know. At this point it's like "Okay, what weekend art projects can I have some excuse to do that for?"

I think this is the nature of early-stage startups. The sorts of things that made sense when you were two people and you needed to get something off the ground no longer make sense when you're 25, and you need to start thinking about how business is doing, and… You know, delegation is a skill.

It's a learned skill, lots of times.

It absolutely is a learned skill.

It does not come naturally, especially to perfectionists, or… A lot of times people writing software are like "It's my code, it's my software, it's my thing." It's tough to let go of that and trust other people.

One of the best arguments - our director of engineering was like "Christine, you need to stop rogue-fixing bugs at night." I was like "But why?!" She was like "Because when you do it, it doesn't let us improve our process to make sure that things that you care about get fixed." That is a great argument that I can get behind.

Successful teams are often built on successful systems and processes, so you definitely have to give room for that to take place… Otherwise, if you're just fixing the problems and the problems aren't fixed by the team, then it's kind of hard to build a strong culture, which I'm sure is important in those contexts.

Definitely.

We're excited for your talk at Strange Loop… We're excited for Strange Loop, because we've been trying to get to Strange Loop for a couple of years. I think I off-handed say that to you a lot, Adam… Like, "Hey, let's do Strange Loop this year."

For like four years now.

Yeah, I think there was an OSCON conversation we had a couple years back on the show, where I even said "We should go to Strange Loop", and then we still haven't gotten there… But we're gonna be there. We're gonna be there this year, September 12th through the 14th - Adam, check my work there; I believe that's correct on the dates - and we are working with Strange Loop to invite everybody out.

[00:04:07.22] It looks like an excellent conference; we'll be there, come see us, come say hi. I was looking at some of the sessions, and yours jumped out to me, Christine - Observability: Super-powers for Developers. That just makes you stop and think, "Hey, I want super-powers." Who doesn't want super-powers, right?

Yeah, I had a lot of fun coming up with that title, because it really seemed to capture a lot of the thoughts that have been kicking around my head and our Honeycomb life for the last two years… Namely in that observability is this thing that so many people associate with ops people, or SREs, the hardcore people that carry pagers. And it's true, observability is something that those folks care about, but it's actually more powerful for the folks who are writing the code, and the people who are sometimes in ops folks' minds causing the problems… But it's taking this thing that is associated with like fighting fires, and being like "What if you bring it earlier in the process, and what can it supercharge? What could people do that they couldn't do before, because now they have the ability to see into their systems?"

It helps that I am a huge Marvel and superhero genre fan… I've never actually done a talk before where I was able to pull in so many pop culture references…

Right…?

…and end up – what would have taken normally an hour to sit down and work on something would end up taking like two hours, because I'd get side-tracked on Wikipedia, and Google Image search rabbit holes, looking for that perfect image… Anyway, it should be fun.

Let's get to the super-powers in a minute, because you mentioned something there which we've read is something you care deeply about, and something that Adam and I have talked about a little bit and touched on, but haven't gone deep into… You talk about the ops folks and the dev folks, and how those are different folks, lots of times. And there's a cultural divide - it seems; maybe not always, but generally speaking - between those sets of folks on teams or inside of businesses because of one of the things you said there, which is like the devs are causing the problems that the ops people have to deal with, and one group is on pager duty and the other group isn't necessarily… And there's a gap there, which is something that needs to be addressed, because that's not a good way to be on a team; that's like us versus them. So this is something you care about… Will you share thoughts on that divide and what we can do about it?

Absolutely. To zoom out a little bit, I'm working on a company called Honeycomb with my friend and co-founder, Charity Majors… And in many ways, she embodies the ops stereotype; she's been ops for many years, she says she's carried a pager (she's been on call) since she was 17…

That's a long time.

Long time… Whereas I have much more of a product development background, where I'm like "I wanna build stuff that users touch, and feel, and improve their life…" And before Honeycomb we actually worked together at a company called Parse. It was a mobile backend as a service, and one of the tenets of our engineering culture was that everyone did a day of support. We'd rotate through, and no matter what, you were the one in the email inbox, answering questions for customers.

And what this meant is the people who were writing the software, like me, were always super-aware of ways our software sucked, or ways that were unclear, things that were confusing… And there was a really tight feedback loop between what users saw and the work that we did, whether we were actively writing the code, or the folks kind of maintaining it maintained the systems.

[00:08:05.00] And even there though, there was this element of – I was building the analytics part of the product, writing things, and this new exciting part of the system, and we talked ops and sort of planned out how we wanted to scale… And it'd go live, and inevitably people would come knocking at my door, being like "Hey Christine, something happened to the write throughput on our Mongo cluster or Cassandra cluster. What do you know about it?" And I'd kind of look around and be like, "Um… I don't know. Write throughput on Mongo… Hm. That's a great question." Eventually, we'd work together and track down what had happened in the code.

And I think through that experience again, because we had this feeling of always being on the same side and kind of working to support our customers, we became very aware of the different types of skills, like going to building a system that is resilient for your customers, and how much better things were when we were looking at the same information.

I think of those days of seeing write throughput is up as sort of the past, and actually our post-acquisition time as the future, in that when we got acquired by Facebook, we were exposed to an internal tool called Scuba, which was for the predecessor of Honeycomb… Which allowed for a lot more flexibility in interpreting impact on the system in the terms that I as a developer understood. So instead of "Hey Christine, WTF? Something you did changed the write throughput on this database", it would be "Hey Christine, latency for serving this particular type of request went up on this endpoint for our largest customer. Does this sound familiar?" And those are the entities, those are the nouns that made it a lot easier for me to understand how the code that I wrote impacted production… And really that's the sort of thing that ops folks have innately, that developers have to almost learn, especially in a world where boundaries between dev and ops are blurring.

Developers can't start to adopt that ops sensibility until they see cause and effect. "Oh, whenever [unintelligible 00:10:37.27] of this code for this type of production workload, this is what happens. These are the signals to look for. These are the things I can start to work to prevent or watch out for in my code."

One of the taglines that we've played with, or one of the phrases that we've liked, especially in this realm of observability, is that it allows you to test in production… Which I know means a lot of things for–

Yeah… [laughter]

[unintelligible 00:11:00.28] Feature flag folks are using that, but… I like it in the context of observability, because - what are you doing when you test? You compare actual versus expected. And a lot ops folks, with their monitoring setups, that's what they're doing. "I expect CPU to be within this threshold. Actually, it's over here."

And the more those signals can be framed – in the same way it'd be like "I expect latency to be here, and I expect to be able to handle 2,000 requests per second for this customer", compared to actual, and tie it back to the code that I write… Boy, that's a really good feedback loop, and a really virtuous cycle for developers being able to ship better code in the first place.

[00:11:53.18] Are there a lot of devs out there that aren't in the know? Is it common for developers to just not see that side of things?

I think it's so easy. I think it's so easy to write code based on what you think is normal, or what should be true, without actually verifying it. Adam, I know you said we have background in product management, and this is – to say this nicely, being able to verify for yourself what is happening in production almost lets sometimes developers side-step that product management intuition, or it lets you develop your own intuition based on reality, or it helps you supplement the more qualitative research product management perspective with "But this is actually happening. This largest customer is actually sending us this volume of data", or "We assume that people send us payloads of this type, but are instead sending us payloads of another."

Even just talking to folks at various tech conferences, there's lots of developers who are like "Oh, I write code according to spec, and I write my tests, and I ship it. When things go wrong, it's just something in the infrastructure, not my code." I think that's a mindset that's slowly changing over time.

So that's coming at it from the developer's perspective. That's a technological solution, in terms of observability into the way that this will perform into production, or the way it does perform in production in real life, allowing them to tie back to their code. What about from the ops perspective? Because you're bringing basically the developer closer to the ops side… Is there any effort to bringing the ops people closer to the code, in terms of "Why can't the infrastructure person go back to the lines of code that are affecting this and analyze that?" Is there movement in that direction, or am I in left field?

I think there's some movement in that direction. I actually think of the movement from ops over to dev as being something that is almost part of the broader dev ops transformation movement/migration. Getting ops folks to get more comfortable with automation and code as a way to do their work is something that I feel like has been happening over the last 5-10 years already… And to some extent, that makes ops folks/SREs more willing to get their hands into the code itself…

But on teams of a certain size there's always gonna be folks who are a little bit more comfortable – or there are gonna be folks who are largely producing the code, versus the folks who sometimes stick their hands in to make sure there's instrumentation in place, or to test something… And certainly from my perspective, I'm more interested in pulling the folks who are focused on shipping to be like "Okay, ship faster, but also be aware of what you're shipping, and how what you're shipping is behaving."

Can we actually break down what observability means? It's like this buzzword…

I've got log files, right? Everybody has log files; there you go. You look through your logs… Done.

Right, exactly. [laughter] Just look at your log files. What exactly is observability in the context of these super-powers, and Honeycomb, and this context?

I define observability as the ability to ask new questions of your systems, ideally without deploying new code. I'll break that down… Being able to ask new questions. What this means is if you look at traditional monitoring systems, often you are defining some sort of dashboard and you're saying "I want to know what the average latency of my system is, or the total throughput of requests." So you take that and you put it in the dashboard and you put it on your wall, and it just stays there… And that is a question that you have asked, and that is the answer to your question, and it very rarely changes.

[00:15:47.01] Part of the reason observability has grown in popularity the last five years is really that our systems are now evolving to a point where you can't just predict the one or two questions that will be important, and put them on a wall, and have that be enough. You need to be able to ask questions like "Okay, have average latency up there, but what is the p95 of latency for customers fitting this profile? Or that one customer over there. Or what is the average latency if I remove requests that touch this database I know is slow?" The ability to ask these freeform questions is becoming more and more critical to being able to support these more and more complex systems we've been building.

And the reason we've found ourselves drawn towards this new word is that there is almost a split between the things that are stable enough to monitor… CPU utilization - maybe it's nice to know, but it's not gonna change that much. That's the sort of question you can monitor; you can put it on the dashboard, whatever. Things like "Well, what's happening for this customer? Why does our servers look down for them?" That's a much broader question, it's much fuzzier, where the answer to that might be different whether I'm looking today, tomorrow, or next week.

Someone came up with a phrase that I really like; I can't remember who it is right now, and I'll get you notes afterwards… But what they've said is "If testing is for known knowns, where you're trying to capture known behavior and immortalize it, and monitoring is for known unknowns - you know you might care about CPU, but you don't know what it is at this point - observability is for unknown unknowns." And I love that, because this idea of unknown unknowns really does, again, provide the perfect flipside to testing (a known unknown). With observability you're like "Well, something will go wrong in my system, I just have no idea what it is or where to start looking, and I need a tool that will work with that uncertainty and work with that flexibility, rather than hemming me in to the questions that I thought to ask ahead of time."

That last part of the definition of observability where I tacked on a "without deploying new code" is important to include… Because lots of folks can say "Well, I can ask any question I want of my monitoring system. You just add a new metric, and then deploy it, and then it's there." But that whole act of having to add that new metric and deploy it…

It's too late.

It's too late, and sometimes it's not even scalable, right? Say you have 100,000 customers; you just can't track 100,000 metrics easily. Caveats - you throw money, or hardware, or something at it, and maybe it'll work… But there's an element of "Okay, something is happening now, and I need to sort it out now", that I think we really now are able to capture, and this concept of observability is an ability to do this thing. Not the type of data, not a specific tool.

So is it just collect all the data, all the time, kind of thing? Or is it collect all the things and then ask questions because you've collected all the data, essentially? …you've monitored, you've logged every possible thing to enable yourself to ask those questions, the unknown unknowns of the future.

I think it is a lot in line with "collect all the data, all the time", but we being engineers, we know that that's a recipe for something that is itself unfeasible and unscalable… Something we at Honeycomb like to talk about is "Capture the data that you think will be important to your business. Capture the data that are going to be helpful in tracking down the issue." There's a couple things here, and I'll break that down…

First, when I say "Capture all the data" or "Capture data that is necessary", I mean capture all the context around things that are happening in your system. This is, again, in contrast to more traditional metrics and monitoring. In metrics and monitoring it's very common to be like "Okay, let's just increment this counter when requests come through."

[00:20:00.29] From the observability perspective we say "Oh man, if you're only capturing a counter, you're losing all this interesting context and useful metadata around what sort of requests they were, and who issued them, and what the requests were trying to do, and how long they took, and then how long they've spent in the database, and how long they've spent rendering, and how long they've spent doing these other things."

So context plays a big part because those are the bits that are going to be necessary for the unknown unknowns, for tracking down the things that went wrong.

Another dimension on the "capture everything all the time" - "all the time" does not necessarily mean you should be capturing information about every single request. I think for many folks, especially the folks who come from the logging world, sampling is a little bit of a dirty word… Like "Oh no, you can't sample! How am I ever gonna capture the low frequency events that are important? You're asking me to throw away data? No, I can't." And while yes, storage has gotten much cheaper, and we could store everything if we wanted, ultimately the model of using logs to capture a historical record of everything that happened made sense when logs were human scale, or our software systems were human scale; it made sense to have a human with their eyeballs, reading through log lines of what happens.

One of truths is that our systems are no longer like that. Logs are no longer human scale, they're machine scale, and as a result, we can start to do things like sample intelligently and capture just enough to gain a sketch of what's happening in our system in real time. Things like sample intelligently, things like okay, if you care a lot more about errors, then capture 1% of all successful requests and 100% of anything that hit a 400 or 500.

Maybe you have certain customers that you care about, or certain customers that you know you don't care about, because they're high-volume - great, sample that down. All of our tools now are capable of doing this sort of statistical analysis and statistical compensation for these more complex sampling rules, and they can allow us to manage the volume of overall data while not having to miss out on that rich context that actually allows us to answer questions and solve problems in our system.

Break

[00:22:39.11]

So you explained that sampling is logical; it also is counter-intuitive, because you have all the people who are like "Well, if I sample the wrong thing, I'm gonna miss something…" And as you described, observability is for the unknown unknowns; well, that's the hardest thing to know about right? Because you don't know about it. De facto, you do not know what you don't know.

Totally.

So what are some of the heuristics or ways that you can decide what's important and what's not important? Because like you said in the first segment, tracking all the things doesn't really scale well for most businesses, so these decisions have to be made… And yet, you don't wanna miss something that you may need. You mentioned maybe an important customer, or maybe an error you wanna track more… But tell us more on these decisions and help folks decide "What do I need to observe and what don't I care about?"

This is a great question. One of the principles I really like to have in my head is that with any of these data tools, the data tool is only gonna be as good as the data that you're getting into it. Put garbage in, you're gonna get garbage out. So these questions around "But what do I sample? Where do I capture data from?" are so important to always be aware of.

I think that there's a perception – well, first, if observability strategy is the high-level thing that you're working towards, instrumentation and figuring out where to capture data from is the tactic to get right. And a lot of people think about instrumentation and they're like "Oh my gosh, this seems like so much work… Having to go in and say that I wanna capture data from this… Don't you just have an integration I can plug in out of the box and have it work? All of my APM tools just work out of the box."

I think that it is awesome when things work out of the box, but ultimately you know your system best. You know your system best, you know what your business cares about, you know what tends to go wrong in your infrastructure, you know what is even bound to the application; those APM vendors may not. So out of the box, getting something up and running might be helpful for making sure you don't miss any of the common bits… But ultimately, thinking through "What are the sort of entities I care about when breaking things down for my business?"

I like to talk about Intercom, one of Honeycomb's longest and oldest customers - for a long time before they've had Honeycomb were not able to break down by app. Being a b2b company, they needed to be able to say "Well, this customer or this app is doing this thing, and this other customer is doing this other thing." That was just something that was important to their business, that previously had not been able to be translated to their engineering tools. And that's the sort of thing that only your engineering team is going to be able to go in and be like "Oh, okay, here's this entity; I'm gonna shove this into our metadata, our data tools, so that I can ask questions that incorporate this piece of metadata."

When we talk to folks about getting started with observability or doing that first passive instrumentation, there tend to be a lot of these questions about "What matters to your business? [unintelligible 00:26:54.00] We use Kafka pretty heavily, and it tends to matter which partition things get written to, so that's a piece of metadata that gets captured in all of our dogfood instrumentation.

Back to what I said earlier, there's this perception that instrumentation is this big lift, big thing that you have to get right, and it's a lot of work… And to that, I say "It doesn't have to be." It's something that's iterative, it's something that evolves along with the code that you're writing. The same way documentation and comments tend to evolve, or tests evolve as the logic underneath changes, so should your instrumentation. With that frame of thinking, it's almost like you start off capturing a baseline of things that you think will be useful.

If you have a basic web server, you probably care about "Handle this request" and it returns this HTTP status, and maybe came in from this user or customer ID… And as your understanding of this system evolves and as your understanding of the questions that you might want to ask evolve, you can just add new fields, add new pieces of metadata. The schemas that you're capturing, or the bits of data that you have to work with end up changing, and growing, and sometimes shrinking if you're pulling out stale fields.

[00:28:23.24] A lot of people don't like this answer, because it requires some thinking. It requires something like sitting and being like "Well, what does matter to me?" And no one likes to be told–

What do you tell those people? What if I said "Yeah, I don't like that answer."

It depends on whether I'm wearing my Honeycomb hat or not. If I'm wearing my Honeycomb hat, the answer is usually "Cool. Well, good luck. Talk to you in a couple months."

Right. So take your Honeycomb hat off and answer that question then.

With the Honeycomb hat off it's a little bit more like "How much have your underlying system technologies changed? Are you playing with microservices? Are you playing with containers and orchestration?" If yes, chances are your practices around supporting that are going to have to change also.

The idea that we can change how we deliver and host software without changing our thought patterns about how we ensure that those pieces of new technology are working the way that we expect is kind of mind-blowing. Logging tools and metrics tools really came into being like 25 years ago, when we only had grep and we had counters; APM tools came into being at some point along that path in order to bridge the gap between "Okay, I want these greps, but then I also want some flexibility in being able to get down into more data." Those tools are struggling - especially the ones that have been around for a while - to keep up with the containerized world; things that rely on stable hostings tend to not be so happy when you have 100 nodes that you've spun up and spun down three times over the course of the day as you're experimenting with something.

This increased attention being paid to "Am I capturing the information that I need from this more complex system, to answer these more complex questions" I think is a good thing. And there are lots of patterns and good practices that you can use to minimize the amount of work that you have to do, and to make sure that you're on the right path, but ultimately, all of the custom logic, all of the things that matter to your business bottom line are things that are only gonna be inside your head.

It seems that as the trends in software architecture move towards microservices and towards serverless components, observability trends alongside those, moving from a place where it's a subset of context in which it's worth the effort to instrument the correct things - I was about to say "instrument all the things", but not almost all the things - and set up these circumstances in which you can ask questions about your unknown unknowns, towards a place where it's more broadly like "Everyone's going to need this" if we're going to continue to move into this more nebulous, cloudy (I apologize for the pun) circumstance of serverless and microservices… Because we just aren't as close to the "metal" as we used to be. Like you said, when we used to have just grep and counters… Things are changing; as we move in that direction, it seems like observability becomes more and more paramount.

[00:31:52.09] Yeah. I think that serverless is a great part of this also. Again, instrumentation doesn't have to be this big whole heavy lift; it's just a question of "Well, what actually matters?" For Honeycomb it's that our customers, if they write a payload, they can query it in under a second. Oh, okay, so let's start our instrumentation in order to capture what the user is seeing; let's find a way to capture at the API layer and the query layer in order to ensure this experience, and then as we need to, we can go deeper into the stack, we can go deeper into the code, add the instrumentation for what happened at the merge step inside our query engine etc. But when you're first starting out, leave that level of detail out, until you know that you need it.

Some things it sounds like would be hard to observe, and some things it seems like would be easy to observe… So if you take our completely self-centered circumstances - there's certain things about podcasting where it's hard to observe. Our listeners, for example - we don't know very much about them. Adam and I happen to not care too much about that; but as an unknown unknown, perhaps we might want to know something about that. Or more on the infrastructure side of the question… That's more on like maybe the advertising. But on the infrastructure - how fast are they able to download all of our episodes? How do we observe those things? That's a little bit easier for us to track.

So what are some things that are traditionally hard to observe, or maybe people think they are hard to observe, and they really aren't that hard; or on the converse, what are some things that people think are easy and actually are hard? Slice and dice that question however you like.

This is an interesting question. I might come at it from another angle. There are really interesting parallels between this burgeoning observability trend in the ops and engineering and [unintelligible 00:33:48.22] space, and business intelligence folks, and almost data science. Honeycomb will go out there and be like "Oh, you can do these things with your data. You can answer these questions", and there's someone out there, sitting on their giant Tableau instance, being like "Pfft… I've been able to do that since I don't know how long." [laughter]

"The most interesting man in the world…" It reminds me of those commercials.

[laughs] Right. I'm gonna take the actual differences between Honeycomb and Tableau aside, set them on a shelf, won't get into them here, and just point out how silly it is sometimes that there are these divisions between closely-related disciplines. Business intelligence folks and data scientists have been dealing with unknown unknowns forever; they've been dealing with this question of like "Oh man, why did profits go up last quarter?"

From a completely different context though, not from the ops perspective.

Yeah, totally.

Exactly.

But it's almost the same actions. Thinking about this observability movement, it's exciting to me because it means that maybe engineers and operators and technical folks will be able to not purely think about "Well, I have this data. What can I do with the data?", but instead start thinking about "What are the questions that I need to answer in order to ensure a good experience for our users?"

What if you found out tomorrow that the Changelog wasn't accessible to anyone in France; a whole geo which is unable to access it because of something in the infrastructure. These are the sorts of things that because we are technical folks, because we are engineers, we're so accustomed to looking at what we have to work with and then figuring out what we can do with it, than starting to think about what our tools can help us achieve, and then setting up the data that we need to achieve those goals.

It's almost literally like a hacker, where a hacker has to think about how to infiltrate and circumvent a system. You almost have to dream how your system will fail, or problems that will come up, or things that will essentially [unintelligible 00:35:57.03] your user experience that you desire, whether it's throughput speed etc. You almost have to dream of like what could happen and then monitor the data from that.

[00:36:12.01] Kind of. I think of that as the middle step. But you can even go higher and be like "What would get you out of bed?"

Coffee. [laughter]

Well, in the middle of the night [unintelligible 00:36:14.16] What would make you go get the coffee? And if you are Shopify, it might be when a user is not being able to check out. "Oh, crap!" That is the problem. Okay, now let's think about "What are the ways things might go wrong? What are the pieces of metadata that we might need in order to quickly isolate where users aren't able to check out, to users who aren't able to check out because they aren't able to talk to that database."

Being able to think of it from the perspective of whether your customers are able to achieve their goals is how frankly all software should be written or thought about… Which is a little bit of a harder sell, so we tend to focus on observability and the technical things that can be achieved.

Is that a starting point though, the "what would get you out of bed"? Is that how you approach the necessary pieces you would wanna capture to query the unknown unknowns of the future. Is it just simply that question, or is there other questions? Because to me, that's a great question to ask, "What would get you out of bed to fix something, pager duty etc.?"

Yeah. I think that that specific question - people have associated that too tightly with what is true in their present day. There's some people who would be like "I would get out of bed if disk space is over 90%", which is certainly an answer, but doesn't quite carry the same end user impact that we want [unintelligible 00:37:48.09]. I think that it's more "How do you know that something is actually broken, or is actually impacting your business or your customers? What are they experiencing?" Set alerts, or set your pager on that.

I can see this conversation forking, and there's a whole path that could go into reducing alert fatigue and burnout and over-monitoring that I will not go down over the course of this podcast, but… There are a lot of smarter folks who have said things on that front, where - again, asking the right question or thinking about the signals that actually matter is something that can really improve an engineering team's lives, culture etc. on a whole bunch of different levels. Observability is just a really great opportunity to start asking those questions.

So if there's somebody out there that's like "Great! Sold! Observability rocks. I wanna implement it. I wanna bring it into my organization", what are the steps? Who has to be convinced or sold the idea of it and what are the tooling? …obviously, Honeycomb is one of them, but you mentioned APM earlier, you mentioned other tooling out there. What kind of tools or steps would somebody go through or take to start to chisel away at observability for their organization?

I think that tools are a catalyst for conversation, but rarely that first step. I think that first step is always going to have to be "Oh man, let's take a step back and think about whether we can answer the questions that our organization needs. Do our current tools/practices support looking at this from the customer's perspective? Do they support being able to break down by app ID or shopping cart ID if those are the most important things?"

[00:39:43.14] From there, folks can then start to try things, like "Okay, we have this data tool. We don't really wanna swap it out, but I want to add this new field. Or I want to add the ability to compare this customer versus that customer. Great! Let's try that." As technical people, we want technical answers for "Oh, just use this technology. Buy a Kubernetes, and then it'll fix your problems."

I was hoping there was an easy way, but it seems like there's not.

[laughs] But I think starting these conversations can at least keep a lot of this at a human level, and identifying those questions and those pieces of information that you want to be able to interact with in your data tool is the first step. From there, then it's a question of "Okay, can your tools support that? Is your toolchain or how you're instrumenting support being able to answer these questions?" If not, then that core set of questions works well for both "Let's take this set of questions and go figure out which tool makes sense for us", as well as arguing upwards, saying "Hey, these are important questions to the business. We need to be able to ask these questions. Hey, Mr./Mrs. VP, I want a little bit of time or budget to explore this better way that my team can support the software."

These are very abstract things - on Honeycomb's site we have a white paper section in particular where my co-founder Charity and Liz Fong-Jones have recently published a framework towards an observability maturity model that provide a number of these questions, around "Can your team do this? Can your team do that? These are signs of your tools not being able to help you minimize tech debt." I think that document in particular provides a great way to start thinking about and evaluating your organization's current observability practices, or to start mapping out a way to improve them.

Break

[00:42:03.12]

Christine, let's imagine a software developer, and this person is interested in super-powers… And you have promised said developer super-powers if they will just adopt observability. So give us that pitch - what does the super-power look like in this context? What do I get out of it from the dev side? …and I'm going to adopt the concepts, and try to get the metrics going, and I wanna observe my system - what do I get out of that? What are some super-powers?

Great. First, let's think about the sorts of things that a developer has to do throughout the software development cycle. Maybe you are deciding first what to build, either because something is broken and you need to fix it, or because a product manager is handing you a spec; you need to figure out how you're going to build it, the architecture review, the feasibility assessment, then you need to make sure that it works (local testing), you need to make sure ideally that it works in a broader sense [00:44:00.29] sometimes you're pushing something behind a feature flag… And then often you're responsible for the maintenance - making sure that it doesn't throw exceptions in production, or what have you.

[00:44:17.09] My thesis is that observability can impact all of these. It can improve your ability and super-charge your ability to do any of these, not just that last one. I'll throw a couple of stories and examples at you. I think my favorite one is the how to build something… Because a lot of people are like "Okay, I have a spec. How do I do it? Let me just come up with something that I think will work locally. Let me come up with something that–" if you're a TDD [unintelligible 00:44:42.17] maybe you write your tests first, and then you're like "Well, now I just have to write code that will satisfy this use case." How do you even know that that's the right use case? How do you even know that that use case is representative of what your code will encounter in production?" The way that observability comes in is it lets you actually verify your assumptions. "Okay, I think that my code will have to handle workloads of this sort, payloads of this size, things like that. It will actually let me make sure that the code that I'm writing will behave well."

An example from our very early days - at its core, Honeycomb has an API that accepts a whole bunch of JSON, and we were trying to decide… We had this ticket that was like "Okay, well we should unroll nested JSON. Flatten it." Okay, great. The correct thing to do is obviously to do this by default, so that folks get this better experience. The engineer who was working on it was like "Wait a minute… Let's double-check this first. Let's find out who would be impacted, and let's make sure that if we do this, it'll have the intended effect, which is our users being happier rather than being unhappier."

So what that engineer did - his name is Ben - is he made the two-line code that would have unrolled the JSON, or figured out how deep the JSON blob was, and instead of deploying the change right away to actually do the unrolling, captured something in our instrumentation that said "If we had unrolled, it would have added these new fields as a result of the JSON blob being nested with a depth of 3 or 5." And he was able to find out that something like a third of our customers were actually relying on things not being unrolled. Thus, the correct thing to do is to have that be an option.

That is the sort of thing where if he hadn't checked it ahead of time, if he hadn't actually verified in production that a third of the customers were relying on a certain type of behavior, he could have just blindly shipped this "improvement", and made a bunch of people unhappy, and maybe cause some incidents down the road. That is an example of how even just the "how to build something" can be improved.

How did he check that in production? I might have missed that… The metrics were already in place to check how much…? If you remember the details…

No, the metrics weren't in place. But as he was writing it, he was like "Well, while I work on the full pull request and while I write the tests for the code that I would want to ship, I'm gonna prepare a smaller PR to just look at a payload as it comes in, and alongside our "Oh, I'm handling this request", capture a bit that tells us how deep a JSON payload is."

Gotcha. So he made it observable.

Yeah, he made it observable. And I feel like one of the keys to working with a tool that supports this whole workflow is having the tool be – not even just tolerant, but have the tool be totally fine and love new fields being added as necessary.

[00:48:07.01] One of the principles we try to build Honeycomb to is adding a new line of instrumentation should feel as adding a comment to your code. It should be lightweight; it should be something that developers do because they have this new question and they want to see what happens, rather than some big, hairy process that involves lots of ops people stroking their chins to figure out whether they should do it or not. So the developer in this case was able to just ask this question almost in parallel with the code that he was writing.

A more concrete and more fully baked version of this - there's a company called Geckoboard in London, and they are a very data-driven company. At one point they wanted to build a new feature that part of it reduced down to the bin-packing problem, NP-complete problem. Their engineers probably could have spent quite a bit of time coming up with the perfect implementation of this NP-complete problem… And their PM was like "Well, let's just test a couple of quick implementations against our real production workload, capture the results, don't expose it to customers, run it for a day, and then we can see which implementation of this algorithm performed the best." They did it, and they were able to pick one and throw the other two away, and then move forward.

By running this sort of experiment in production, by making production not feel like "Oh, that's what happens when the code is fully baked", but is instead part of the development process, they were able to move much faster and be more confident that the implementation they eventually went with is one that would serve their needs.

It seems too, as you draw back, that observability - the super-power is having more eyes on the data (or in this case an experiment) around assumptions. You're no longer a lone ranger, isolated. You now have your entire team's eyes on the same dataset, and you no longer are alone.

Yeah. It's more eyes, it's smarter eyes, it's eyes that can see deeper into the code… Previously, rewind five or ten years, you had the ops people watching graphs and the developers shipping code. And with these efforts around observability, around making these tools able to talk about the terms that developers care about, you're able to invite developers over to this part of the room, invite developers to watch and think and be like "Oh, I noticed this thing. I as a developer have context that you as an ops person don't. Great. Now, we can improve this. Now we can react faster, or know better, or take this learning [unintelligible 00:50:53.21] from production and feed it into our whole development team."

One of the things that at a simple level a lot of monitoring tools can't handle well is being able to break metrics down by build ID. As a developer, knowing for sure whether my change was included in a specific change or drop in a graph - that is the most useful thing, because that tells me whether I need to care or not.

"Don't git-blame me, my commit wasn't in there."

Yeah. [laughter] That's the "Not my problem" version of it. But that's how you start to really directly attribute, like "Oh, okay, what I did had an impact", and often that's a good thing. Like "Oh, great. The performance thing I shipped did do what I expected it to do." Otherwise you're like, "Did the build go out at this time? I think times line up, or all the machines on this new build…" There's some uncertainty there.

[00:51:59.26] It's about being able to see and understand really what your code is doing, rather that just guessing at abstract signals, and hoping that they tie back to the code that we shipped.

If I put on my product manager hat from years ago, I often sat in the middle of businesses' desires and our team's ability to execute on those desires, and potentially even create something that can make money. It's a multi-faceted job. I might think that observability might even be a super-power for a product manager, or somebody in charge of engineering, because you now have more resilient code, you have less issues, and that means it's more cost-effective to actually run your team and your code. So it's a business problem more than just simply a developer's super-power.

This is very true. That particular angle tends to go across less well at developer conferences…

[laughs]

…but certainly, that's the appeal, right? It's recognizing that… And this is why I get so excited about framing the question from the perspective of "How does this impact customers? What is the business impact of it?", because that's what gets other people in the organization looking and paying attention and supporting it.

Product managers, product analytics are their whole own thing, and product managers need very advanced tools to make distinctions between funnels, or retention, or all those sorts of things… But there is so much power in them being able to share and understand and ask questions in the same playground and using the same tools that the engineers do.

At Honeycomb admittedly we're a small team, but our support folks use the same tools that engineering does to verify "Oh, yeah, this customer is saying they're seeing this thing. They ARE seeing this thing. It looks like this. Oh, hey, engineering, this thing is happening", and now that hand-off is able to be a lot more informed and educated.

Our product managers are able to ask questions like "Okay, if we make this improvement, which customers is it going to impact right away?" I think that there are a lot of things that I think of as something that really benefits engineers. Running queries, being able to feel fast and iterative - those qualities really benefit anyone who's adjacent to the product development process, whether you're a product manager or a support person.

The ability to ask questions of your systems in production is not constrained to engineering disciplines at all. It's people who care about how that software is behaving.

We have a couple of non-profits who are using us, where they use Honeycomb to spit out some graphs that their chief donation officer cares about… Because they just happen to be able to incorporate the entities that the chief donation office cares about - donors or donation amounts - in with the same data that they use to assess operational stability. Can you imagine? …if you're running a donation platform and you can say things like "We were able to tease apart some correlation between donations that were slow and donations that are large"? You can literally quantify the business value immediately of an engineering work that you're doing. That sort of thing I feel like is the holy grail of different parts of an engineering organization being able to really understand their impact, rather than just "Oh, I made this thing faster because I wanted to."

Right. There was actually some true effect on the business, and now I'd even dare say the users too, because they obviously got more excited about whatever they're doing in terms of donating, and they were able to do it.

I'm pulling a little quote from your white paper that you referenced earlier, the white paper on this framework… It says "The acceleration of complexion in production systems means that it's not a matter of IF your organization will need to invest in building your observability practice, but WHEN and HOW."

[00:56:09.12] Systems are getting more and more complex, and as we just said before, the business case value of some of this instrumentation to be in place, to capture this data and provide this ability for more than just one set of eyes to see a problem is not a matter of if, it's a matter of when… Because most things are moving to cloud, most of the things are becoming more and more distributed…

Absolutely.

There doesn't seem to be a downside in regards to the data collection like there is on the business intelligence side… Just thinking back to that dichotomy of like "We're doing the same things in different areas." Whereas on the business intelligence side you have the creepy factor of tracking people and doing too much. Maybe there is even on the observability side, on the infrastructure; maybe you can speak to that, Christine… But it seems like, aside from the scalability problems of collecting too much data, you don't have the privacy and security concerns that you do like you would on the front-end. Do you think that's a fair statement, or are there still concerns with regards to privacy and security of your customers, the server-side analytics maybe, that happen here with observability?

I think that there's still some risk there. If you're Stripe, or something, at some point in the code you probably do have some variable that holds some sensitive PII, and I think that there are a number of different laws, as well as internal practices that allow people to protect that data.

Certainly, with great power comes great responsibility, and when you make it very easy for developers to capture the metadata they might find interesting, that tends to be something that organizations need to keep an eye on as well. "Hey, let's make sure not to send personal addresses to plain text", as it were.

Yeah, that's fair.

Well, coming at a Strange Loop near you, right? All this and more.

I'm super-excited about it.

I'm excited about this. You mentioned this is your first time at Strange Loop, this is Jerod and I's first time at Strange Loop… We have something tentative on the stage, we're still not sure what that is, but if you're listening to this and you're going to Strange Loop, then hey, you might see us on stage and you can see something live. We're thinking about some sort of fireside chats… We're still working through the details, but it's a lot of fun, so…

You can definitely see Christine live, as she gives her talk "Observability: Superpowers for developers." Let's finish on a really tough question - favorite superhero, roundtable style.

Oh, boy.

We'll let Christine go first, you're the guest. Favorite superhero?

It's gotta be Storm.

Storm… Nice. Halle Berry version, or comic book version?

I think comic book version. I like Halle Berry, but Storm in the new Reboot generation is pretty cool; she's got her Mohawk… I'll get behind that maybe.

Nice. Adam, what about you? I've known you for a long time but I've never asked you this very personal question - favorite superhero?

I'm gonna go super-OG, super-obscure, and I'm gonna say Spawn.

Ooh…

And the reason I'm gonna say Spawn is because I'm a huge fan of Todd McFarlane. He was responsible for the reigniting of Spider-Man. He once drew for Marvel, so a lot of the modern look of Spider-Man can be attributed to Todd. And there was this sort of revolution, so to speak, in the comic world, and he and some others from Marvel broke off and created Image. Image was the brand under which Spawn was, and I just love Todd's art. He's amazing.

[01:00:04.27] Yeah, the art style of Spawn is awesome.

So Spawn isn't really the best character, but I think he was well-done. The first 20 issues of Spawn were amazing. And as a matter of fact, I own all of them. [unintelligible 01:00:12.23] all that good stuff.

What do you do with them? Do you have them up on a bookshelf somewhere? Do you observe them?

[laughs]

I went back into observability here… [laughter]

I have some observability on them… No, they're actually just in a shoebox, tucked away in a closed, dark, away from all the elements. But I've got that and plus a ton of other comics that I used to collect… But Spawn is my favorite.

Awesome. I never knew that about you, I'm glad I asked. Well, I'll go super-boring/super-mainstream/Superman. Sorry. I love Superman, I always have. I think it's probably just the first superhero that I ever learned about as a child… And he's got all the skills, he's got everything. And yet, somehow he still injects drama into the shows and into the stories, because he's gotta choose; he's always gotta choose who he's gonna save. I also like Batman quite a bit, so I'm pretty boring, but… Superman.

Very cool.

Well, since you said Batman, I can say that I'm a huge fan of the most recent trilogy. I think that was probably the best of all Batman, in my opinion.

Well, we may be able to save that conversation for an episode of Backstage… [laughter] As we're now completely ignoring our guest and we're just talking about movies.

No, since we're going down that road, if either of you or anyone listening to this hasn't watched the Spider-Man: Into the Spider-Verse…

Oh, my gosh…

Loved it!

What a great translation of comics to movie… What a great way to tell a story that I was not excited to watch again, because I'm like "How many Spidermans do we need?"

Yes…

But literally, they took that and they played with it, and that was a lot of fun… And in many ways, the inspiration for the title of this talk.

A hundred percent.

I'll plus-one that, but I'll also add to it…

Plus two.

…because this is append-only, not takeaway… [laughter] I'll add… Because you said you weren't excited about seeing another Spider-Man - I will agree, until I watched Spider-Man: Homecoming. It was actually really good. I loved the fact that…

I liked that one, too.

This is super-Backstage, so this is extended, but whatever - I loved the fact that they kind of remade the story with a Peter Parker that was a part of that was a part of the Avengers… And I might be spoiling some of it, but just this whole new aspect that sort of brought it into the Avengers story, and kind of gave it more of the bigger universe Spider-Man appeal than just simply Spider-Man alone.

It's crazy to me… Whenever I meet someone who just hasn't been following the MCU – you know, with MCU you're either all-in or all-out. The way that they tie the stories together - they've made it so rewarding for people who have watched all the movies.

I was less of a fan of Homecoming I think than you are, but definitely [unintelligible 01:03:03.12] appreciated it.

Yeah. That's what I mean, too. It wasn't like "Woo-hoo! I'm so glad it is…" It was just definitely good.

But it's kind of weird that we were all kind of over Spider-Man, and then they released back-to-back Spider-Mans, both of which were good, and one of which - I think it was the Sony production, the Multi-Verse one - to me was groundbreaking. It was like "This is so amazing."

Very impressive.

Well, now that we've officially turned into Backstage and out of the Changelog - Christine, thank you so much for sharing your wisdom here, and for the work you're doing at Honeycomb. I can't wait to see you at Strange Loop, looking forward to the talk. Thanks for sharing your time here today, we appreciate it.

Thanks so much for having me.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

0:00 / 0:00