Changelog Interviews – Episode #527
What it takes to scale engineering
with Rachel Potvin, former VP of Engineering at GitHub
This week we’re talking to Rachel Potvin, former VP of Engineering at GitHub about what it takes to scale engineering. Rachel says it’s a game-changer when engineering scales beyond 100 people. So we asked to her to share everything she has learned in her career of leading and scaling engineering.
Featuring
Sponsors
Sentry – Session Replay! Rewind and replay every step of the user’s journey before and after they encountered an issue. Eliminate the guesswork and get to the root cause of an issue, faster. Use the code CHANGELOG
and get the team plan free for three months.
Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com
Notes & Links
- Rachel adores EngFlow (investor and advisor)
- Rachel’s 2016 paper on Google’s developer infrastructure
- Harvard Business Review on Psychological Safety
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | This week on The Changelog | 00:55 |
2 | 00:55 | Sponsor: Sentry | 01:51 |
3 | 02:46 | Start the show! | 10:11 |
4 | 12:57 | Requested by name | 07:38 |
5 | 20:35 | How did you know or learn what to do? | 05:04 |
6 | 25:39 | Understanding constraints | 01:45 |
7 | 27:25 | 100 people is a game-changing threshold | 10:08 |
8 | 37:33 | Lead by caring for people | 02:23 |
9 | 39:56 | What's worth building now? | 07:46 |
10 | 47:42 | Using decision logs | 03:58 |
11 | 51:40 | Choosing the right tools for communication | 04:44 |
12 | 56:24 | This is NOT my problem | 07:48 |
13 | 1:04:12 | What is psychological safety? | 05:55 |
14 | 1:10:07 | Scaling, but maintaining code health | 11:22 |
15 | 1:21:29 | Wrapping up | 02:46 |
16 | 1:24:15 | It's been fun | 00:20 |
17 | 1:24:35 | Outro | 02:25 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
So we’re here with Rachel Potvin, former VP of engineering at GitHub… But Rachel, you’ve done some amazing work at Google, engineering manager, engineering leader… Your previous current role at GitHub has been amazing. You’ve been the VP of data at GitHub, making sure that lots of people can collaborate on code, which is just the most amazing thing, right? So of course, welcome to the show.
Thank you so much for having me. I’m glad to be here.
Well, this is a – I guess it’s kind of a good time to be just leaving GitHub, or being at GitHub, because you guys have just done so much amazing things; you’ve got Copilot out there, you’ve got all sorts of things happening… Actions is just amazing… But let’s talk about some of your – I guess some of your history there. What are some of the amazing things you’ve done? You’ve done some cool stuff, but I don’t want to say what you’ve done. You tell us what you’ve done.
Thanks, Adam. Yeah, it’s just been just a real privilege, and it’s so wonderful to get to work at GitHub. It’s really an incredible company, doing some really, really great things for developers around the world… So it’s easy to talk about so many great accomplishments… I had the great privilege of leading a large swath of the product engineering team; in fact, most of it. And so there were so many things that happened within my team that I’m just so happy to have seen get out to developers.
Yeah.
So for instance, I got to form the team that created GitHub’s Advanced Security product area; this came from nothing, and with a fantastic acquisition from a company called Semmle. We built up that product area to over 100 million ARR in under three years, which was a really exciting journey, and really fun to work with all sorts of folks on that.
Like you said, my team launched Copilot, we launched CodeSpaces… A personal favorite of mine, we launched the new GitHub code search and navigation experiences, which I think is just phenomenal for developer productivity… You know, I got to bring lots of renewed focus to the core productivity experiences, even around repos, and issues, and projects, and PRs. Really investing in the scalability and sustainability of that legacy codebase. But honestly, I would say that my favorite work - and this is kind of on-brand for me, I guess, is less about specific product milestones, though those are always really, really exciting… But I really get a lot of happiness from building healthy engineering practices, and a strong engineering culture, that really can sustain these product launches and these features and this growth, and of course, all of the excellent people involved. In my role, I used to always say I’m like 50% focused on the product areas that I’m managing, and 50% focused on all of engineering and what needs to happen to keep our engineering teams happy and healthy.
[05:55] Yeah. I’d love to examine the other 50%, to some degree, because I feel like there’s a lot of personal details, I guess, the relationship that comes into business, or into leading teams that just sort of kind of goes somewhat by the wayside when describing accomplishments. Like, I’m so glad that you said how you divide that that up, because so often it’s that we did this, we launched that, it was amazing, this is how it scaled, this is what was the impact… But at the same time, you kept people healthy, employed, not crazy, showing up to work, keeping their fitness, keeping their self-care going, their marriages and relationships going… You know, not just shipping, right?
Yeah, one thing I was really proud of was the level of nutrition within my team during the pandemic was quite a lot lower than a lot of other areas… And I think that’s because of the focus on healthy engineering teams. And look, the size of my team grew a lot during my three and a half years at GitHub. And GitHub engineering actually tripled in size during my tenure. So that’s a huge amount of growth, right? And the product area expanded so much as well. But I can tell you a little story, and I think maybe – maybe I was teed up for thinking about culture early because of the way my team first came together.
So when I first joined GitHub, I mentioned this company Semmle that we had just acquired; it was a really, really great company that formed the basis of GitHub’s Advanced Security. In my second month at GitHub - you know, Microsoft had already acquired GitHub, and Microsoft kind of realized that there was this Azure DevOps group doing great work in the DevOps space, and they were effectively competitive with GitHub, right? So someone realized, “Hey, we should like merge these teams, right?” And so a whole bunch of Microsoft people were asked if they wanted to move over to GitHub, and a lot of them did. So a whole bunch of Microsoft people joined my team… So then right out of the gate, my team was kind of like 1/3 this ex-Semmle group, and these were sort of a lot of PhD/academic types, and mostly in Europe… I remember on their onboarding, multiple people said to me, like “These Americans who are doing our onboarding are too excited about everything.” And if you’re excited about everything, you’re excited about nothing, and we’re getting exhausted; this is like a different culture, right?
And then 1/3 of my team was the sort of original hubbers, who had been on the startup journey, many of them, with GitHub, and they sort of had this more scrappy, get it done, ship to learn type, open source-first culture… And then a third of my people were from Microsoft, which - they had much more big company experience, they had expectations around the way things should work, establish process and expectations… They were very good at thinking about enterprise, and had much more of an enterprise-focused culture.
And so these were all fantastic people, with very different backgrounds, experience and expectations… And I remember realizing that the first thing I really needed to do was just to bring these folks together with a common culture, and a feeling that, “Hey, we’re all hubbers. A rising tide lifts all boats. I mean, we’re all in this together; there needs to be no us versus them, because that can very quickly become toxic. But rather, we need to have a shared vision and understanding that to be successful, we all really need to work together.”
So I started with that, and like “What am I going to do to make this common culture?” We talked a lot about focusing on the experience of all developers; so not just open source, not just enterprise, but because all developers are people, and they all deserve great productivity experiences, and fulfilling lives, and so on… And then again, beyond my team, I wanted GitHub engineering to share a common culture as well. And I know from experience that setting clear, shared expectations is really key for establishing and promoting healthy, productive teams.
[09:57] So one of the first things I did when I came in, and which I never expected I would be doing, was I wrote and published and socialized career ladders, for both managers and individual contributors in engineering. I’ve worked really closely with HR on that, which was, of course, super-important. But I also – you know, you introduce culture by what sorts of things you reward and what expectations you set. So one thing I did in that process was I introduced a new, more technically-focused career path for managers, which I think lots of people should think about doing… Because previously at GitHub it had sort of been you can be a senior manager, and then if you want a promotion, you have to be a director. And director is a different job, right? A director is managing managers; but you should be able to have career growth as a manager of individual contributors, and so with these new ladders, we rolled out this concept of staff manager, and principal manager, and sort of this technical path for managers to take, which I think is really important for sustaining strong technical teams.
So there’s lots of stuff like this… I established a design review process, and expectations around what kind of things needed to go to design review. And design review is often a communication tool, as much as it is, to get specific feedback on your design; it lets a team over here, on one area of the business, understand and know what’s happening in another area of the business.
I created something called our Principle Council, which I’d love to talk about maybe a little later when we get into healthy things to do to scale engineering teams… But this group was eventually renamed to the Architects Group, but what it did was it really helped support the difficult cross-engineering technical decision-making that needs to happen, and that had really been stalling at GitHub. I set up – you know, it turns into a laundry list, right? …but I set up a developer satisfaction survey…
It does…
[laughs] …like an internal facing survey to find out from all the engineers at GitHub, like “What are your biggest pain point points? What are the things that are slowing you down? What are you dissatisfied about? What’s hurting? And what’s good? Where can we celebrate progress, so that we can really understand and track over time, like “This is the experience that our developers are happening… And if only we could focus on fixing some of these things, then we’d have happier people.”
I also set up operational reviews, I rolled out an engineering-wise strategy that talked a lot about balancing technical debt, developer experience work, privacy and security work, along with feature work, to make sure that it was clear that these things were valued, and that highly impactful work that’s not directly tied to feature launches is also recognized and valued. So I could go on and on, but this is all a lot of culture work that helps GitHub managed that scale and that growth over time.
That’s a lot of things…
It’s a lot of stuff…
Yeah, it is. If we’ve got a good person on the show to discuss this topic, I think all doubts should be removed; you obviously have a wealth of knowledge in this space… And one of the reasons we’re doing this show, Rachel, and we’re happy to have you here, is because our audience requested you, not just in type, but also by name. They want to hear more shows, not just – you know, we talk about scaling software a lot… Maybe not a lot, but we talk about that. Scaling teams to scale software, we talk about less. I think our audience has been clamoring for more leadership style episodes, and scaling style episodes… And they got going one day in Slack, in our Slack community, and said Rachel Potvin is the one to get on the show… And so shout-out to all of our people in Slack who gave us your name.
I’m so flattered. Thank you!
I’m glad to have you here. There’s so many things we could dig into, of the different things that you did in order to succeed; I don’t know where to start… I do want to ask about just that manager bit… And if we just might like pull that out, and then maybe just set it aside and move on… But this technical manager distinctions - is this published work? Is this something somebody else could follow? Like, how do you distinguish between these different managerial tiers, and what differentiates these different roles that are non-VP, or non – you know, they’re still manager roles.
[14:13] You know, something I’ve never been good at is blogging and publishing. I wish I would have written more about this while I was still at GitHub, because I think it’s really important. I think it’s easy to fall into the trap of saying there’s two separate careers. There’s a manager career, and then there’s an individual contributor career, and they’re different. And sure, they are different jobs, but they share a lot of commonality… And one thing I really believe is that there’s a spectrum from the deepest technical person to the most strategic thinking, sort of high-level vision thinker. And that spectrum can exist both on the individual contributor ladder, and on the manager ladder. So you want to be able to give opportunity, and job, and take advantage of that skill set with the different individuals, where they are.
One thing I did at GitHub, I also developed the promotion process for engineering, and I talked a lot about that staff engineer promo, as well… And I know there’s lots of writing out there and so on about staff engineering, but always have to be careful when you make your career ladders. It’s never a checklist, right? It’s always – there takes some interpretation. You can’t be too subjective, but it takes some interpretation to say like “This style of person is having impact in this way.” And the common currency really, at various levels, is impact. So at more junior levels, you’re taking direction really well, you know when to ask questions… By the way, that applies all the way up the ladder. But you’re really good at getting things done in a constrained way. And maybe the next step up, maybe you’re figuring out what the way is to answer a problem, and maybe at a higher level you’re actually figuring out what the problem is that we should be talking about. But I think it can be an anti-pattern to really pigeonhole managers to say, “These are people managers and coaches, and not technical individuals as well, who can understand the depth of what their team is doing.”
If a manager of ICs can’t jump in and help coach their individual - maybe you don’t need to be the deepest domain expert on everything, but you at least have to be able to understand the work that’s happening on your team, and be able to give good coaching advice, or hook your person up with someone who can give them good technical advice, great code review of what they’re working on…
And then I love seeing – I’d love to see at GitHub a distinguished manager; just that someone who’s got a small team of people working with them on the hardest problem that GitHub has. Some people call that maybe the surgeon model, right? You’re the tech lead, but you’re also working so closely with this group of people that you’re the right person to be the manager.
Yeah. What is the difference then when you go from senior engineer, to staff, to technical? What are some of the differences between those three opportunities, I suppose?
Yeah, I think it’s like I was saying… It’s how much agency and accountability are you taking. I remember having a great discussion with a principal engineer who reported to me at GitHub. His opinion was - and I fully agree with this -there’s a little bit of confidence that comes with those levels, too. So if you’re going to be a principal engineer, imagine GitHub’s down, and you’re in the Slack channel with all the people who are working on the problem. Are you willing to be the person who says “We’re going to roll back”? Or “Actually, we’re going to turn off GitHub Actions and impact only that set of our customers, so that we can bring the rest of GitHub back up”?
And so there is that experience and confidence that comes into these levels, but then it’s also sort of the nature of the type of problems that you’re taking on, and how much agency and accountability you’re taking for the solutions yourself.
[17:59] I guess the importance is to not move away from tech even further. Like you had said before, moving to, say, a director role, which is, like you said, a completely different thing; you still keep them closer to the technical problems. It’s kind of similar to – since we kind of know about what you’re doing now, after GitHub, it’s kind of like the ability to keep advising. You’d rather move from senior engineer and continue up your own career ladder into director, out of, say, a more technical role; you get to sort of keep leading and advising within, but keeping your technical skill set within your career path, versus simply going into management, which sort of moves some of that away. You obviously leverage that experience, but you don’t get to put it into practice on a daily basis.
Yeah, I’ve gotta tell you, I’m talking to a lot of startups now, and there’s a good group of unhappy CTOs out there who are kind of turning into people managers for the largest teams they’ve ever run, and their joy is actually from doing the hands-on technical stuff… And so I’ve been talking to a lot of those folks who are trying to find a way to get back to their joy, and really being hands-on. And then, it is a different job to be leading a large organization of people, and that’s a big responsibility, and it’s a different role, too.
Is that advice generalizable at all? Like, can you say to a typical CTO of a growing or hyper-growing company, like “Here’s how you accomplish that”, or “Here’s the highest-impact things you can do”? Or is it always specific to this person, in this place?
Look, if I see an individual who’s in a management role, and they’re really unhappy… You know, we all have a certain amount of agency in our own lives, and I think we all have one life to live, and it’s okay to take one for the team for a while, if let’s say you’re a co-founder, or a founder, and you’re going to be the CTO of your kind company, and that means growing an engineering team, and you’re gonna do that for a while. But at a certain point, if you’re feeling unhappy on a day to day basis, look at what you can do and see if you can change.
There’s a lot of great managers out there, and so finding a really good partnership between an engineering leader and a CTO, or the top technical ICs in the company - that’s a partnership that really needs to form. So I think – I encourage people to find their happiness. That’s what I’m trying to do… [laughs]
Right, find your happiness.
For sure. Well, if we go back to your laundry list of things that you did - and I don’t mean to call it a laundry list like dirty laundry, but an epic list of things that you did… Where does the wherewithal, or the knowledge – like, how did you know what to do in that circumstance? And where does your experience kind of – and surely, some of it was probably explored and discovered as you went, but where do you get the knowledge to say, “I’m going to do these seven things in order to bring these three teams together in a way that scales and establishes culture?” What’s your background that brought you to that place where you could be the one that got that done inside of GitHub?
Hey, I’ve been grinding in tech for 25 years… [laughter] So I have a lot of experience –
Grinding for sure, right?
Yeah… I’ve seen a lot of ways that things didn’t work. Sometimes when you see a counter-example that’s just as good as seeing a good example, and even sometimes more effective; and I’ve tried things that didn’t work. But I’ve seen several common patterns in scaling my own teams over many years… I’ve brought multiple teams to over 100 people throughout my career. At Google I worked in developer infrastructure for a long time, and I brought those teams to over 100 people working in an organization of 2,000 people, with the amazing Melody Meckfessel, who is now CEO of a company called Observable, that I worked for her for many, many years, and learned a lot of great lessons from her, for example.
[22:01] Then at Google I also lead the cloud platform and recommendations platform in Google Cloud, and scaled that team from something like 30 people to well over 100 people. And then within GitHub as well, I’ve scaled multiple sub-teams within my organization to over 100 people when my team itself, when I went to leave, was over 500 people. And so hopefully you learn from experience, right? I mean, I certainly think I did. And like I said, that being thrown in the fire at the beginning of my GitHub experience, where – you know, there were a lot of things that were really surprising to me, in terms of how siloed GitHub was. There were a lot of things in terms of how decision-making was happening that I could tell didn’t work…
I can give you a quick story, which is when I first joined GitHub, Fantastic’s team came to me… And you know, I joined two months before GitHub Universe, which is the big developer-facing conference every year. And this great team came to me, and they were working on a language feature, and they said to me, “Rachel, we have this great new language feature, and we want to announce it and release it at GitHub Universe. As our new VP, can you tell us - should we launch it for JavaScript, or should we launch it for TypeScript, Java, Python?” …you know, the four next popular languages. And I was like “Okay, hang on; this seems like a great feature. Do we need more research? Are we not confident? Why are we just like targeting one population versus another?” And this great team said to me, “Well, okay, here’s the thing… When we first started this project over a year ago, it was easier for us to get CapEx budget approval (that’s like hardware) instead of OpEx budget approval (that’s cloud capacity). And so we ordered a bunch of machines, and we got them racked in our data center, and we were running a MySQL backend… And we have space for the index for JavaScript, or the next four popular languages, but not both, and it takes 12 weeks to order new machines. And GitHub Universe is less than 12 weeks away, and so we’ve got to pick.” And for me, coming from Google, my brain was melting a little bit, because “On-prem what? Like, isn’t it all cloud?!”
Right…
I didn’t know that that still existed, right? I had a lot of learning to do when I came to GitHub. And by the way, that team did nothing wrong, because that was the way things worked. I immediately said, “We’re moving to cloud. This is not going to work”, and they had a year of pain, actually, where they couldn’t scale the product that they had made, and occasionally the scale of GitHub’s codebase overwhelmed them, and they’d have to pull back features, or turn things off, and stuff like that. So ultimately, it had to be a cloud-based product, and they did successfully move to using Azure Blob Store. But that was sort of the awakening I had when I came to GitHub, where I saw “Oh, okay, there’s trouble-making, maybe – like, these decisions that are happening in silos way too much. Like, there’s local optimization, I think, really happening in terms of the way teams are making decisions, and there needs to be sort of – absolutely, the first step of scaling is that teams have focus and agency to make their own decisions.” But then there’s a next step, where you’ve grown beyond that, and there’s certain decisions you need to know that you need to take to another level, and there needs to be the ability that’s not strictly product-focused to make those kinds of decisions coherently for the entire organization. So I felt like I had a lot of learning to do when I came to GitHub.
Understanding constraints in that case was probably key, right? Because if you didn’t ask that question, you just thought, “Well, both, of course”, but you had to understand the fact that they were on-prem, and they had… You know, if you hadn’t gotten to that part, you might have just made a premature decision, or an incorrect decision, to say, “Of course, let’s do both, because they’re all popular, and these are the directions to go.” But once you understood their constraints, you were able to sort of understand more clearly their challenges, right? Constraints equal challenges.
[26:06] Yeah, absolutely. And I’m really happy that – and again, it’s a great team, great people. They didn’t do anything wrong. That was the environment that they were in. But it also highlighted very early in my GitHub tenure, “Oh, interesting… This is how this is happening.” And then I spoke to a whole bunch of teams, actually… And remember, GitHub had been acquired by Microsoft… And I started asking, “Is anyone running anything on Azure? We have a lot of AWS, we have this on-prem, I see we have some Google Cloud… But are we running anything on Azure?” And the answer was no.
I was asking around and trying to figure out, “Do we plan to migrate to Azure? What are we going to do here?” And it became really clear that because the product teams were so siloed, every product team was thinking of its own feature sets, and there wasn’t really anyone thinking about that bigger picture of, “No, we’re going to do the investigative work, and it’s going to take time, and whatever needs to happen.” To figure out how to move to Azure, any one product team would have to throw their entire product roadmap under the bus in order to be able to work that out. And so you need that higher level of thinking to be like “Well, wait a minute… This is something we have to prioritize. We have to be able to have the flexibility to not be so constrained to these product areas, and be able to fund things like this that are going to be for the greater good.”
So when it comes to scaling these teams, one thing I’ve read from you is that you think that 100 people is kind of this threshold of engineers, where it’s like the game changes… I’m wondering if that’s just experientially what you’ve seen, or is that a magic number? And what changes, and why, in your experience?
Yeah, absolutely. It is experiential; like, that is what I’ve seen myself. But I’ve also spent the last several months talking to a whole bunch of startups, which has been really a lot of fun… So many great people out there doing interesting, novel things… And it’s held up, this 100-person threshold. And it may be slightly different for different teams, in different companies; it matters the amount of complexity there is in your product space, how many different sort of customer bases you’re serving, how many different product areas maybe you have in your organization… So 100 is not the absolute exact moment, but definitely, it starts to be hard, and things need to change at that threshold.
So I’ll talk first about kind of what I’ve seen. One of the main things is that eventually, you hit this scale where it becomes impossible for one individual to hold context for everything that’s happening, both in the product, but especially implementation-wise in their head, right? And so certainly, the individuals who are on the product teams will have lost that thread a long time ago; they won’t know what all their peer teams are doing. But, you know, maybe one person until 100 is kind of hanging on, and having a good sense of the various challenges that all the teams are feeling… But eventually, that stops being humanly possible. Work will start happening that doesn’t align well, decisions will start happening that don’t align well.
Life is certainly easy when you have - let’s say, it’s a founder, or founding engineers, or a senior technical person who can effectively make final decisions for teams when they’re stuck. But now you’re hitting this scale where there isn’t necessarily that individual who can do that. And obviously, we’ll talk about the fact that decision-making has to be delegated to teams, right? Like, that’s the first step of scale; you go from having a single team where everyone is working together, to splitting out into focus… And I can give you also lots of examples of where delegation doesn’t happen well enough, and teams are hampered because they can’t make their own decisions where they really should be… And this is exacerbated when you have timezones coming into play, and folks working on different schedules, getting stuck, and so on. So you don’t want that… You need individual teams to be able to make their own decisions.
[30:00] But then there’s these decisions that go beyond team boundaries, and they start to spin. So if two teams are invested in a decision, they can probably hash it out. But it’s these cross-engineering things, big investments… In many cases, you start to see these important technical decisions really stalling. And that’s just a danger zone, when important decisions that need to be made aren’t being made because no one feels empowered, or maybe attentive enough… And probably your eng leader is running the biggest team they’ve ever managed, and maybe they don’t even realize that these decisions aren’t being made.
I can give you an example from GitHub… And of course, GitHub is even at that next level of scaling, with the 1000+ person engineering team… And these problems get exacerbated at every order of magnitude, for sure. But an example from GitHub is it really took us too long to decide that we’re going to be moving to React in the frontend. And some teams started using React, but they were doing so in inconsistent ways, and like “Are we going to be building within the GitHub monolith? Are we building services outside the monolith? What standards are we using? What’s our sort of feel on “Do we want GitHub to get more of like an app-like feel? Do we want sort of like a more static web page?” I mean, there’s a lot of inconsistency into how various teams were approaching this.
On top of that, Microsoft was giving us some pressure about accessibility, and making sure that GitHub respected accessibility standards, which is really important… Is React going to be the means to doing that, or are we going to have some other UI policies?
And so that’s something that took investment, experimentation, investigation, but then ultimately, GitHub was able to say, “Yes, this is the Northstar. This is the direction we’re going to go.” So then that gives a roadmap to every team when they’re starting to think about a refresh of their frontend - well, now they know; they don’t have to guess and evaluate multiple technologies, and so on.
But there’s lots of other things that start to happen at that 100-person threshold at all. I would say also like the technical impact of scale may start to be catching up with you… So process and implementation that was like good enough at a smaller scale may start to become problematic. I have some examples of that I can talk about… You know, with so many engineers, a manual deploy process stops working, and then you end up with all sorts of terrible side effects to that, where people are writing bigger changes, that are harder to code review, and then you end up having more outages… And maybe some people who originally authored the codebase are no longer around, and maybe you don’t have clear code ownership for some things that were written once, and aren’t scaling now… And so you know, outages start to happen, maybe confidence is low in terms of what needs to be done to address stability…
I sort of mentioned this already, but beyond that, I think you see a lot of industry leaders who are starting to run the largest human organization they’ve ever run. And they’re probably, like we were just talking about, no longer touching the code day to day, and they might be feeling insecure that they’re not on top of all the details, right? Maybe they know that important decisions aren’t being made, but they’re not sure that they still have the right level of insight to even make those decisions.
Right.
Maybe they’re working with a CEO who is super-focused on customer-facing progress, who doesn’t want to hear or doesn’t think about infrastructure tech debt, developer experience etc, and so that starts getting less prioritized on the team… Or I’ve definitely talked to startups where the CEO was the one who wrote the first version of the code, and they’re opinionated, but also, their knowledge is stale. And so it’s just a super-hard job for these individuals who are trying to maintain that balancing act.
And so these are all things that I think start really getting exacerbated at that 100=person scale. And the good news is, there’s a lot of things you can do, but it’s interesting to see how prevalent it is.
[34:00] Yeah, for sure. How then do you get that person or persons that has that – I guess you can kind of say it was confidence in one way, but the ability to see that there’s a problem there, and then start to enact change? You’d mentioned they wouldn’t see the problem anymore, they were too far away from it… How then from a VP level do you start to give people that agency to make those changes, or to see more clearly and make choices and decisions? Because it seems like when you get to a 100+ organization engineering-wise, like you had said, one individual can’t hold all that in their personal RAM; it begins to be divided, and whatnot. How do you get to that point to give people more clear access to what needs to actually happen?
Isn’t there some quote or something that like recognizing the problem is half the battle, or…? I’m terrible at quotes, so…
Sure. That’s GI Joe, I believe.
I think it GI Joe, “Knowledge is half the battle”, or something like that. Yeah.
Oh, it’s GI Joe? My goodness… [laughs] We’re going way back there then.
Yeah.
Wow… Okay.
Well, half the battle, I believe, is from GI Joe. Everything else is from something else. I think it was a combined –
It’s a remix.
Either way. Yeah, either way.
Let’s just say it’s a Rachel original then maybe? I don’t know…
Sure, why not?
But no, I’m sure it’s not. I’m sure it’s not.
There we go… I think you’ve just coined it.
[laughs] But recognizing that things are changing, and that you have to work differently, and that the way things have gone before will no longer continue to work… It is something that people realize; and whether they realize it sooner or later, they will realize it, because again, you’re going to hit one of these problems where you have a massive outage, and you don’t feel equipped to handle it… Or you’ll realize that “Wow, we’ve been spinning on this decision for a really long time, and we haven’t made this decision. How come we haven’t made this decision?” So it will be noticeable eventually, it’s just sort of “How soon do you notice, and how much do you put in place while it’s easy, so that when you get to that level, you can kind of sail through it?” Definitely, a lot of things that can be done. You can do work to avoid technical scaling bottlenecks early by focusing on code health, and having best practices in place. You can proactively invest in your developer experience before your developers are screaming that they can’t deploy anything. You can set up individuals who are directly responsible for different product areas, and different technical domains to give them agency and accountability in decision-making… And there’s a lot of things you can do with culture to really make sure you’re valuing different types of work, right?
A failure mode I see a lot of companies get into is being way too user-facing-focused. And it’s great to celebrate launches and product launches and great feature launches and so on… At Google there was an expression, “Landings, not launches”, which I really liked… Because – you know, I was talking to the Copilot team about this a year ago, where I said, “I actually don’t care about getting to GA with Copilot. I care one year from now do we have a healthy team that can maintain the thing that people are depending on?” Just getting something out the door is not what you have to worry about. You really have to worry about what happens next. And so culture has a really a lot to do with that.
So yeah, I mean, I think people will always hit that pain eventually… And so I’d love to help people notice it sooner, and be ready to address it sooner.
It seems the somewhat secret sauce might be the concern and care for actual people in the mix, right? Like, one thing is clarity and expectation… This is something you’ve said several times, and part of the way you lead is–
Very clearly.
Yes, exactly.
[laughs]
But it seems like this desire to care for individuals - it’s different whenever you lead with, like you had said, a launch, not a landing. A landing is safe, intentional, or at least it’s desired to be safe and intentional.
Yeah, not always…
[38:01] If you’re landing, it’s like “Let’s make it soft. Let’s not make it abrupt. Let’s not damage our knees.” I’m thinking airborne for the Army, for example - when you come out of an airplane and you’ve got a parachute on, it’s easy to damage your knees if you don’t land properly. So landings are intentional, they’re safe, they have some sort of circumstances around it, you have some care for individuals… It seems like that’s a somewhat unknown secret sauce to how you lead?
Well, I would say also the way I define landings is you achieved what you wanted to achieve, right?
Right.
So you can launch, you can get top of Hacker News, whatever, and that’s cool… But six months after launch, have you got the usage that you wanted to see? Do you have the retention that you wanted to see? Are you perhaps generating the revenue, if it’s a revenue-generating product, that you wanted to see? Do you see people using the product the way you expected them to be using it? And so before you go to any launch, you should have at least as clear as possible a hypothesis and a target of where you want to be, and what you want to achieve… And that’s something that I think launches are hard, but they’re easier in some ways than sustaining, right? Sustaining - you have to have SLOs in place, you have to have a good on-call rotation, with good playbooks, you have to understand what’s the cost of keeping the lights on for this service, how do we handle customer escalations and user escalations, how do we triage work, how do we prioritize? Is this scaling? What scaling bottlenecks are we going to hit? Sometimes success is a double-edged sword, because suddenly the way you wrote this thing is no longer going to work, or your number of machines that you have in your MySQL or on-prem backend are not going to be able to fit what you’re trying to do… And so to me, that’s what a landing is - it’s really like “We have something that people can depend on, that’s reliable, that’s sustainable”, and so on.
One of the challenges that I am seeing is this competing concerns with – I don’t know, just like our propensity to build the wrong thing, or to yak shave… We have YAGNI, which - when it comes to scaling, a lot of us aren’t going to need some of the scaling things. But when we do, we really do need them. And then there’s also things that we should be building right away. So you can’t bolt on security, for instance. So when it comes to like engineering something, security you should be thinking about from the beginning. But a lot of us, in trying to prepare for the possibility of scale, never get the launch done, because we are setting up our CI/CD, right? We picked to Kubernetes when we may never need it. Or we spent all this time developing things that we didn’t need, and then it came time for us to need something, and we didn’t develop that thing. Like “Oh, I wish I would have had this incentive system in place.” Right?
So it’s difficult to like fake what’s worth building upfront, because some of these things you said can - if you’re prepared to scale, if you rolled out a Kubernetes cluster from the beginning, and it turned out that you had this huge launch, and now you’re scaling, and “Wow, it’s amazing. We can just get more nodes, or whatever” and it worked, as opposed to like an on prem MySQL server that just hit a wall, and you’re done. And so especially now that you’re talking to startups, who may or may not have to scale, are there ways you can help people, help us think about these things, where it’s like “What’s worth building now, and what is premature optimization that’s going to be completely a waste of my time, and never push my business forward?”
Such a good point, Jerod, because I’ve said over and over again, to my teams and to various folks that I’m advising and coaching - everything’s a trade-off, right? And it’s not obvious. You have to assess the cost and the benefits. And a lot of times for startups, being first to market really matters. I think you want to be really intentional sometimes about accruing technical debt, and that’s perfectly fine, because you’re eager to get something in the hands of customers and see, “Do we have product-market fit, or do we not?”
[42:06] And so being able to be thoughtful and intentional and make those decisions, I think a lot of the times – like, definitely don’t try to over-engineer something if you don’t even know if you have product-market fit. Get something lightweight out there; get a prototype out there and see what kind of reaction you get, and learn from your users.
GitHub has – one of the sort of philosophies, I guess, is called ship to learn. And I like it, and I hate it. I kind of wanted to burn it down, but I also appreciate it, right? But it’s like, what I want to do is add nuance to it, which is ship to learn the things you should ship to learn, and be really deliberate about the things you need to be really deliberate about, if that makes any kind of sense. And so what kind of decisions can you unwind quickly, right?
I love ship to learn for like UI features and UI changes. I think that’s really healthy and good, and where you can iterate quickly. But then there’s changes where like “This is gonna be really hard to back out of.” I’m writing this data schema, and it’s gonna be difficult to undo this. Or I’m adopting this new infrastructure - I’m gonna get ship to learn it. Let’s have a design doc, let’s talk about it, let’s really get the right set of eyes on it…
I’ll tell you, I set up this engineering-wide design review process at GitHub. It’s really good. Half of it is a communication tool, right? Sure, people got really good feedback on their design docs. And by the way, not every little thing needs to go to engineering-wide design review, right? There’s layers. We think about how broadly impacting is this change I’m making; if it’s just on my team, then let’s just do a design for my team. And actually, maybe it’s just something that I’m going to ship to learn and we don’t even need [unintelligible 00:43:50.05] But for certain things - I’ll give you an example that the issues team at GitHub want to start using CosmosDB, because we’ve been a very MySQL, backend company, and we have these more sort of NoSQL use cases cropping up for storing issue hierarchy… CosmosDB seemed like a good fit, and so - bring it to engineering-wide design review, and then all the various teams who are thinking “Oh shoot, MySQL is not really working for me either” can come and be like “Oh, here’s the use case I have”, and it’s a communication tool and you talk about it, you get it out in the open, and then you get some good feedback, and so on.
And so yeah, everything in life is a trade-off decision, and so I would never advocate for always building for scale from the start, always addressing your technical debt immediately. No, there’s very legitimate reasons to make concerted decisions there.
I think the challenge I see is – I’ve definitely talked to some startups recently who maybe were intentional about saying, “Hey, look, we’re gonna just not worry about this technical debt. We’re gonna hack together this feature and get it out quickly.” But then do you lose track of that technical debt? Did you forget about it? And does it show up six months later in an outage, and actually now it’s a bigger deal because various other things happened that built upon it?”
And so I’d always advocate for being intentional about the choices you’re making, and having a way to track decisions and understand where you have things that you’re probably going to have to look at later.
Hm. Wow.
And also, by the way, thinking about what scaling, sort of throttling type limits can you put into your product initially, so that – you know, I can’t tell you the number of times it’s happened where I wasn’t paying attention to that API, and suddenly, like “Oh my gosh, a bunch of people have used it for this really expensive use case that we sort of never imagined.” The GitHub code search API - people were using it to like count all instances of their API being called through all GitHub codebases ever, and it’s like “That’s a super-expensive query.” It’s not really what GitHub code search is about.” But there were no limits on the API, and so customers - of course, humans will do things the easy way, and if they find a way… And so do think about how your product might be used, do put in place user limits, throttling, anticipate how things you might want to be alerted about when you hit certain thresholds and certain scales, right?
[46:32] I’ll tell you one that is a personal sort of concern of mine that I’ve seen at GitHub. GitHub has about 40 repositories that go into the GitHub platform, and it’s sort of a lot of the newer product areas are in their own repos, and are separate services… But there’s also the GitHub monolith, which is a Ruby on Rails application, which is issues, and PRs, and projects, and sort of all the core functionality of GitHub as a code hosting site is really in that monolith. And we’ve had a lot of scaling problems at GitHub with deployments, partially because of the way the Active Record paradigm works in Ruby on Rails, where the data layer is too tightly coupled to the logic, and so people are making database changes all the time. And if you only have a few people working, that’s manageable, but that starts to become unmanageable pretty quickly with the number of engineers… You know, beyond that 100-person threshold, there’s certainly more than 100 people who touch the GitHub monolith. And so that’s created a lot of complexity for deployment and a lot of bottlenecks that need to be addressed.
I can definitely imagine that. Going back to the decision-making, do you use and/or advocate for like a decision log, or some sort of place of record? I’ve never done this, but I imagine at scale you’d want to have like “Here’s the decision - we want CosmosDB for this product. Here’s the analysis we did, here’s the decision we made, here’s the constraints we are working under, or the assumptions, and this is why we’ve picked it.” I’ve heard people say you’ve got to have one of those, because the short-term memory of an org - especially in the software world, we churn so much, right? People move on, and switch roles often, and so you don’t have that institutional domain knowledge stick around very long… So I’ve heard decision logs are a great tool for that kind of knowledge. Your thoughts?
Yeah. Look, any tool like that is as good as it is findable, and as good as it is clear. And part of the culture - and I’ll give you an example at Google… You may have heard of GoLinks. GoLinks is a company I think that was created based on the way linking worked at Google, where basically if you knew a product area, you could type go/ that product name, and you would land on their documentation. It was just fantastic, because everyone used it. But that’s a cultural thing, because everyone knew where to work.
I’ve talked to someone who was working at DuckDuckGo recently, and they use Asana for everything, and they do decision logs, and they have just a very clear process, and everyone knows to look there, and everyone does it. So you can’t just have a decision log without the culture to go along with it.
Right. You’ve gotta have buy-in.
You’ve gotta have buy-in, and you show people that this works, and that it’s usable, and then it becomes advantageous, and then people buy into the culture. I spoke recently to the CEO from a company called Dream Team, and they have a project called Cata that I’m keeping an eye on, because it looks really good in terms of this sort of project management… They do integration with Slack, integration with GitHub, integration with JIRA… And again, it provides that functionality of everyone knows where to look. So you can set up a decision log in that product, and type on Slack the right keyword decision, and it’ll end up there, and then people don’t need to look around.
I think one of the challenges I’ve often seen is like “Yeah, let’s document this decision in a Google Doc. Or maybe this one’s in a repo, or maybe this one is somewhere in Slack.” And that’s cool, but if it’s not findable, it sort of doesn’t matter.
[50:15] So to answer your original question - yeah, I’m a fan of lightweight decision logs. I’m a fan of design documents also, and chances are your design document points to your decision… But even more so is that culture you need around “How are we doing things, and where are things found?” It could be a really big challenge.
I’ll say even – even an org chart, right? At GitHub there wasn’t a great org chart, and one of the engineering directors on my team wrote a new org chart; it’s the org chart we use now, and I was like “Oh, Harry, thank you so much for doing this”, because even just being able to find who’s working on what, what person should I talk to… You really have to be careful - and again, this comes back to the 100-person scale - around informal networks and needing to know someone who knows someone to find out the information you need. As much as possible, when you get this information into systems, then you can find the answer on your own, and it’s easy and quick.
I think when you have that informal culture of network, and “Oh, I’ll just so and so who will know”, then you propagate meetings… In this remote culture, it’s never just a five-minute question. You always book a 30-minute meeting with someone to ask them maybe the one question that you’ve had… And so then you’re sucking all your time into meetings. Whereas if you have clarity of where to find information, that can really go a long way.
Yeah, I’m kind of glad you went that direction, because Jerod, I was thinking that same thing, but your question was slightly different than how I would have asked it… It was more like “How do you choose the tools to communicate?” Because it seems like you’re a clear communicator. If you can find it – like you had said, when you have access to information, you don’t have to have so many meetings, and you rely less on your network, because you have to know somebody who knows somebody to get access to the information… But when you’re in hundreds, and then to thousands… You know, I’m not asking you to use Slack over JIRA, or to use this over that, but how do you organizationally choose what becomes culture, the tools you use to communicate? How do you do that? Do you build your own tools? Is it that “invented here” kind of situation? Because even at small organizations like ours, which is a very small organization in comparison to yours, we still don’t have a clear culture of “If you want this information, go here to find it” in lots of cases in code; and we can go find it in our GitHub repo, of course… But if it’s written, there’s probably three different places we may have used over the last five years. So our culture has not been adopt one tool, use it heavily; it’s been fractured across many tools, never consolidated. So how do you, at that scale, hundreds of thousands –
Well, don’t feel bad…
Okay…
Yeah, don’t feel bad, because that is super-common, and–
Yeah… We’re also early adopters, so we try out every new thing, and so that’s part of what we do. So there’s some of that culture, like “We’re gonna try the new thing and see if it works for us.” So we have knowledge bases spread amongst startups that are alive over the years, for sure… But go ahead, Rachel.
[laughs] That’s fantastic. Well, I mean, if that’s what you’re doing, that’s okay; that makes sense for you. But it is a cultural challenge. Do you build your own thing? I would say ideally not.
Probably not, right?
I mean, project management is not the core competence of – unless that is your business. This kind of product that I was talking about - that is their core business, so they should use it, and they should build their own thing, and they should make it amazing, so that everyone else can use it. But it doesn’t matter, right? Like, is it Asana? Is it a GitHub project? Is it Google Docs that are well organized? Pick your battle. I think a lot of things can work, but with lack of clarity, every team in your organization will do something different. And that’s when you get into trouble.
Yes…
So it’s just standards and consistency. And you don’t want to – I mean, we can go back to everything’s a trade-off. You know, you don’t want to be too heavy-handed about things, and be like “You must work this way…”
I was gonna just ask that… Like, do you just dictate it? Yeah…
[54:09] Yeah… But there’s certain things where it’s – it’s a virtuous cycle, I think, where you say, “This is where we put design docs. Everyone do it, because then you’ll find the design docs you want to find, and that’s a good thing. So please do this.” And you can – as a leader, I can actually go and say, “Why didn’t you do this. I need you to do this next time.” But the best is when people see “Well, okay, this is helping me, and so it’s logical.” It’s not process for the sake of process. I think you have to be extremely careful about rolling out half-baked process, where it’s going to introduce friction for teams.
And another thing I can talk about, which we touched on in decision-making, is different types of decisions hold different weight, and can be undone or fixed or changed more easily or less easily. Well, different types of teams are working on different types of projects. And so I’ve definitely seen the pattern where a leader will come to me and say, “Why is Team A moving so quickly, and Team B is moving so slowly?” Oh, well, Team A is iterating on a UI for something; just like important and hard work, but the pace of that change is different than Team B, that’s building infrastructure.
So I also never want to say like “Well, Team B, you should be having a burndown chart that looks just like Team A, and I want to see the same amount of velocity and–” No. Team B probably has to do more prototyping, more research, there’s going to be some dead ends in terms of maybe what they’re investigating… Maybe they have a “buy or build” decision to make that’s going to require some research that won’t end up in a milestone deliverable, right? Other than a decision. And so like keeping that in mind, I never want to be too heavy-handed with process. The right amount of handedness, if that makes sense… [laughter] Everyone has to figure out what that means for their organization.
In adequate amount, that’s my favorite thing. My wife says, “How much do you want?”, when it’s like food, or… It’s like, “An adequate amount. I don’t want too much, or too little… I just want an adequate amount. Right in the middle there.”
When it comes to, I guess, “Not my problem” - not that this is a good attitude to have; like, you can say, “This is not my problem” when it comes to decision-making… How do you deal with who owns certain problems? Obviously, you’ve got a senior engineer in place, or a tech lead, or somebody that’s in charge, but how do you solve for that responsibility layer?
Yeah. I mean, this is where – so when I talk about the things you can do to effectively scale, I think I put them into pretty much three buckets. So there’s a lot going on in code health, there’s a lot of advice I have for teams around code health and developer experience, and so on; there’s a lot of advice I have for teams around how to think about decision-making, and then the final one is culture. And culture encompasses all those things, and more. But it’s fine, sometimes something isn’t a team’s problem, right? Sometimes you want your team focused on the product area they’re working on. They should have a mechanism to surface - maybe something’s come up, maybe we’ve noticed something… Where do you bring those problems? Is there an obvious place? Is there a spot where you document, like “Hey, this thing isn’t working. I don’t think it’s for me to fix, but someone should know”, right?
The thing I set up at GitHub, which - you know, it was a learning process; all this stuff is a learning process. I think you’re never done. You never say like “Okay, I set everything up that I need to do, and now my organization is humming perfectly, and I can just drink a Margarita”, and whatever, right? But the Principal Council, which was renamed to the Architects Group, had a backlog where any engineer in the company could add an issue, saying “Hey, I think someone should think about this.” And not everything would get touched, but the Principal Council was effectively the most senior engineers, individual contributors in the company, coupled with me and my two peers, who were the engineering leaders.
[58:13] And so the most senior ICs had hands in the code on a daily basis, were deeply familiar with how things worked, and represented different product domains and infrastructure within the company. And me and my peers held the responsibility for cross-eng prioritization, and funding, and were able to move people around from different teams… I think one thing you want to be careful about is that people don’t develop too tight of an identity to the thing they’re working on, and that you don’t get such siloed teams that it’s difficult to move people and say, “Hey, look, we really need help over here. Can your expertise and what you did in the past come into play over here?”
So me and my engineering counterparts, we’re able to have conversations with people and say, “Hey, can you come work on this problem? We’re setting up a special virtual team to really address this thing. Let’s get this done.” I would always ask one of the most senior ICs to be champion for any decision that needed to happen, and they were responsible for communicating decisions around that specific area, and really - not necessarily being the lead implementer, but mentoring and coaching the people who were taking charge of the problem area. And so yes, it’s fair for people to say “This is my problem”, but there should be a mechanism for important things to get surfaced. Does that answer your question?
For sure. I mean, the fact that you have some sort of garbage collection, essentially, which is what that is… It’s almost like “How would you write a program, or a compiler, or something like that?” It’s like “Well, you need garbage collection.” That’s kind of what that is. Like “This is not my problem, but it is a problem, and somebody should know about it”, and you’ve got some sort of organized body willing to have an inbox for that, big or small, and then find ways to communicate that back to you and others who are leading the organization at a larger scale, to say “How do we deal with this?” in some way, shape or form. Because the “It’s not my problem” situations are really a challenge, because you might find that issue, but it’s like “Well, it’s not mine to fix”, as you said, “but somebody should know about this. Who do I tell? Oh, I’ll tell nobody. Let me get back to my job, climb my ladder, do my thing… Okay, cool.” No, we can’t have that.
And there’s also – like, the DevSat survey that I talked about, right? That’s a great way where you’re asking your internal engineering teams, anonymously, “Tell us, what are your biggest pain points? What are the things you’re most worried about? What are the things that are not working for you?” And it’s not just the squeaky wheel in that case who’s gonna get the attention. You can see aggregated over your entire group, “Hey, look, true story. Every single person is talking about how painful deployment is.”
That takes trust though, doesn’t it like it?
It does, it does.
You have to trust in that organization to say those things and not get the backlash, potentially. And then you have to have a frequency in some sort of case to get that feedback often enough, right?
I think you’re so right, trust is so important… And so all this stuff plays into culture. I will tell you, I did AMA’s with my team fairly frequently. AMAs is Ask Me Anything, right? And I was so happy when I would get really pointed, hard questions. I’d be like “I don’t love this question, but I’m glad you’re asking, because then I feel like you trust me that I’m actually asking you to ask me what’s on your mind.” And if you’re only getting softballs, if you’re only getting easy questions, then you really have to ask yourself as a leader, “Are people scared to say the right thing?”
Yeah. Is there freedom of speech here? Yeah.
Is there…? Yeah. And sometimes it’s like “Look, you’ve got to move on. You can disagree and commit on this. This is what the answer is. I know you don’t love it, but we’ve got to be able to move on.” But other times, there’ll be things that I’m not even aware about.
[01:02:00.01] I tried all sorts of experiments. I did one time an anonymous AMA, which is a really funny experience. I think it worked out well, but I had people anonymously submit questions - and I should have called you guys to interview me and say the questions, or something… But I did it by myself. So I did like a one-hour recording of myself by myself, answering these questions… And it was nice, because I was able to gather some data to answer some of the questions too, but there were some really hard questions. It was during the pandemic, and there were a lot of things that people were worried and insecure about… And I just thought, “I’m really happy that people felt safe enough to ask me these questions, and that I’ll be able to answer them.” I think that that is really important. That’s a cultural thing that you can’t undervalue.
And even in the DevSat survey, one of the questions that I would ask is about psychological safety. How decisions were made on your team… So there’s a lot of questions around the specific developer experience, but there are also culture questions on there, that then with that survey, I would give it, as a leadership survey; so I was interested in the broad trends across everything. But then it was a survey that each manager who had enough respondents would get, so they could specifically look on their own team, “Do I need to set–” We used OKRs - which are objectives and key results - every quarter, so set some goals… “Do I need to set some goals around psychological safety on my team? Or maybe around some other process that’s not working, or on-call?” On call was a big one. “People are really stressed out about on call, maybe we need to do more training…” So that was the use of the survey, too.
And then actually the third group that would benefit from this survey was specific product areas. So GitHub - we decided that the paved path for development at GitHub was going to be using Codespaces, and so when we rolled that out, of course we got lots of interesting feedback on that survey about the experience of using Codespaces. And so that was valuable feedback to the Codespaces team to be like “Okay, here are some things we can focus on.” We want to make our internal customers really happy, and that’s going to be important for them making our external customers, who we have less access to, happy as well.
I kind of know what you mean by this; this is sort of a question to kind of get deeper at it, but when you say “psychological safety”, what do you mean? How does that translate to actionable findings and details? What actually is that?
Yeah, because I have to say, you have to be careful about over-broadening terms like that, right? Psychological safety does not mean that no one can give you constructive feedback, right? And that’s really important. When I talk about, again, scaling eng teams and culture, this is one that’s coming to bite a whole bunch of startups, and I think it was a problem at GitHub as well, where people conflate kindness, and maybe pleasantness, or something like that… And so sometimes it can be really hard to make good decisions if people are too scared to say the real thing.
It’s actually – and I’ll get back to your psychological safety bit, but it was fascinating to me when I rolled out eng by design reviews, because the first design review happened; it was a topic – I’m trying to think of what it was… It was something around monitoring and alerting [unintelligible 01:05:18.13] And this is important; it was gonna affect all of engineering, right? So perfect thing for a design review. I’m hosting this session, and I’m getting all these DMS, right? And so the way I would set up a design review is people are supposed to be informed coming into the room – you want to make the high bandwidth meeting as effective as possible. So everyone’s read the doc, you’ve put all your comments on the doc… The design review’s for resolving comments that can’t get resolved asynchronously, right?
[01:05:47.15] And so then we’re in the room, and I’m getting these DMS, and people are saying like “This thing won’t work. This thing they’re proposing - it’s never going to scale.” And I’m trying to host a meeting, but then I’m DMing back, “Can you say that?” Like “Yes, I agree with you. Can you say that?” [laughs] And people were like “Well, I don’t want to be a jerk.” And it’s like, well, it’s not a jerk if you’re telling a team – you have very relevant experience. Look, you’ve done this before, you know – this is this team needs to hear what you have to say. Don’t just DM me and try to get me to say it. It’s gonna come better from you. You’ve done this before.
And so that was like a cultural barrier to overcome, where GitHub had come from this history of consensus building, which is problematic also, right? Like, consensus is great when you get it, but you can’t live by consensus, especially when you start to scale. You need directly responsible people who are accountable for decisions, who are going to make unpopular decisions. Not every decision you make can be popular, right?
And so I actually took over one design review just to talk about culture, and be like “Hey, how do we have these hard decisions where you’re not being mean to a person; you’re not saying mean things about that person. We need to be able to talk –” It’s the same thing about blameless post mortems; human error amount might have happened in an outage, and you have to be able to say that, and say, “Here’s some automation that we could put in place that would make it less likely for that to happen again.” It’s not an attack on the individual ever, but we have to be able to learn and grow.
So that’s a little aside, because I get nervous sometimes when we talk about psychological safety without that framing. But psychological safety to me is being able to say things that you’re worried about, things that are on your mind, things that you think are important, without fear of retaliation, or retribution. And that is invaluable. So I always want my teams to have psychological safety, so that they can ask me hard questions, so that I can realize, “Oh, I had no idea that this is such a problem for you. And by the way, the last 10 staff engineers that I spoke to told me the same thing. Wow. Now I’m going to do something about it, because clearly, this is a big problem.” And so if people don’t feel safe bringing things up, then you just don’t get the information you need. But that’s different than being too pleasant, or too kind, right? Empathy coupled with accountability…
What does this liberty do then for toxicity? Does it squash it completely? Does it just expose it further?
Hey, look, toxicity is something I’m never going to tolerate. And I think that’s a cultural thing as well. Like, what do you tolerate? I always say, how you reward and who you promote speaks more to your culture than anything you say. Right? And so when I would host training sessions on promoting, specifically for staff engineers, it’s like “Look, toxic behavior is not tolerated.” So that’s belittling someone, attacking someone, shouting at someone… All these things have happened to me in my career. We’re not going to–
Complaining though… My framing there was more complaining. Because you can freely complain and be toxic; you could be pleasantly toxic, too. [laughter] And I just wondered how that blends, you know what I mean?
So it comes back to this concept of knowing when to disagree and commit. If I tell someone, “Look, I’ve heard your point, maybe I empathize with it, but I’m sorry, we’re not doing anything about it”, and then you keep bringing it up - that’s being toxic, right?
Yeah. Okay.
And so complaining is not productive when the solution is not happening, or the situation is not changing, right? So I do expect people to be productive. I do also want to hear about the things that are bothering people, that are maybe not fixable, because maybe at some point in the future they will be fixable, or maybe there’s an opportunity to move someone to a different team, where that won’t be as much of an issue. So like everything, it’s a trade-off, and there’s judgments involved… But yeah, there’s definitely a time to stop.
Yes. It depends… Trade-offs… The classic answer.
[01:09:56.23] Is that my answer to everything? Sorry… [laughs]
No, no, no. It’s just what happens. It’s inevitable. It’s more like a defeatist position than anything.
So while we’re talking about trade-offs, you mentioned the three buckets of scaling engineering teams: code health, decision-making, and culture. We focus a lot on decision-making and culture. We talked about code health a little bit with regards to YAGNI, and premature optimization, things you can do now versus do later, and how we often trade off code health for speed, shipping etc. But when it comes to scaling in an engineering org, what are some things you can do with regard to maintaining the health of the code, which allows everything to actually move forward productively?
Yeah, great question. I feel like this is a podcast unto itself at some point, if we ever wanted to do that, because there’s so many things… And it’s overlapping with culture, as is everything; that’s going to be my answer for everything today too, but… An example where it overlaps with culture is like code review. I love the culture of prioritizing code review above your own work, right? It’s not always feasible. I’ve definitely had problematic situations where a poor engineer in Europe woke up with so many code reviews in their inbox, because all of the Americans, right before signing out, were like “Oh, he’ll be up soon”, and then that person would just be drowning in code review.
But in general, having code owners, and the ability to affect large-scale codebase evolution requires people doing effective code review. And a failure mode I’ve seen is where – you know, I had another principal engineer who was reporting to me at GitHub, who made a pretty simple change into – basically, to keep it simple, the way Go worked at GitHub. And so basically, everyone writing Go code at GitHub had to review his simple code review. And that should be fast and easy, right? But it wasn’t. I needed to get involved to escalate for teams outside of my area to say “Hey, after a month, you still haven’t prioritized this code review. You need to do it so that we can roll out this change.”
And so really having good code review tools… Again, we talked about design review - very important. And then developer experience, and like at what scale are you going to start thinking more about your developer experience is really important from a code health perspective. I’d love to tell you a little story about deployment at GitHub, because it really resonates with many of the startups that I’ve spoken to recently… GitHub got into trouble with its deployment strategy, and is on the right track now, thankfully, but it’s a surprisingly common story to see that in developer experience build and test times get longer, and there’s test suites running that don’t need to run, and so on… But like deployment is a particularly painful one, and I would say there are like three areas where it really hurt at GitHub. One was just the volume of changes got too high; too many people wanting to deploy. And so there, if we’re only considering GitHub’s primary deploy target, which is github.com, just the number of different people wanting to deploy changes on this fairly manual process that required human engagement, started creating friction.
GitHub has this kind of unusual “deploy, then merge” strategy. So for code changes, you actually deploy your code first, check that everything’s working, and then merge back into the main branch, so that main is always available for rollbacks. It’s kind of an unusual strategy that I wouldn’t necessarily recommend, because it’s part of the scaling challenge… But GitHub moved to using deploy trains to help with that volume of changes, and this is still very manual though… A conductor, who would be the first person who got on the train, would be responsible for shepherding the change.
[01:13:53.09] And then there’d be all sorts of gamification that happened. I had a teammate who was like “Why am I always the conductor every time I want to roll out a change to the monolith?” And it’s like “Well, because everyone was hanging back, waiting for someone to take that role, and then you jumped on, and you were the sucker who every time –” Yeah. [laughter] And so this is a bad experience…
And, then I started hearing from people too, like “Well, I won’t even try to deploy something after lunch, because if I ended up being responsible for that - who knows? I’m gonna be stuck till after dinner, waiting around… So I’m just gonna wait till tomorrow.” And so you can see the sort of like aggregation of friction there, and how much that slows down development. It’s just not acceptable.
In DevSat - I mentioned the satisfaction survey - deployment came out as the highest friction. And then like all these other side effects that affect code health, like people writing bigger changes, code review becoming more difficult, changes being deployed become more risky… So an increasingly problematic situation. And that was just for .com.
And then - and this is a situation that happens at a lot of startups, too. github.com isn’t the only deploy target for GitHub. There’s GitHub Enterprise Server, which is an enterprise-focused product, where customers deploy GitHub Enterprise Server on-prem. And for them to do upgrades, they require downtime, right? And so the way this worked was they replay all the database changes, update the code… But database changes are unpredictable timing-wise. I already talked about how way too many database changes happen at GitHub because of partly active record, and sort of the way the monolith is sort of like not well componentized across the data layer… And so then GitHub Enterprise Server’s customers started having an unpredictable amount of downtime for their upgrades, which is a problem.
Also, most of the GitHub engineering teams were really focused on .com. So “I got my feature out to .com. I’m done. The ops team can deal with whatever.” So then this poor ops team is managing the upgrades for Apple, and IBM, and all these big customers, but also lots of small customers… Debugging becomes more difficult because “Is your feature in the enterprise server deployment, or is it not?” There’s a whole challenge with feature flags. We did a really fantastic tech debt cleanup actually around feature flags, where there have been so many feature flags at GitHub that were on permanently, or never been turned on, or on in the worst case scenarios that are like different configurations for different enterprise customers… And so that became problematic as well.
And then the third piece to the deployment puzzle at GitHub, which was really enough to say “Stop. We’ve gotta really invest in how we do deployment” was on-prem enterprise product is not the state of the art; it’s not where most companies want to be, and so GitHub really had to develop a cloud SaaS offering for enterprise customers. And this is something GitHub has been working on for years. There’s a lot of pressure on it. Obviously, downtime for upgrades in a multi-tenant SaaS product is not a thing, right? And so there has to be a way to propagate deployment to that endpoint in a healthy way as well.
There was lots of pressure from leadership to get this product out the door quickly, and so GitHub did try to take shortcuts, tried various strategies to replay changes from .com to the cloud, and never could work, never could scale. Especially the frequency and unpredictability of the time required for database changes just made that untenable; like, how do you interleave code changes and database changes with the right timing, with the right lead time? The enterprise product would always end up getting so far behind that it could never catch up to .com. So that just wasn’t working.
What an issue there… That sounds like a big headache, basically.
[01:17:48.25] But it’s funny, because I’ve talked to multiple startups who are in this situation as well, where they had maybe a community product, maybe an open source product where deployment is a little bit more straightforward, and then now they have an enterprise-specific product… And in most cases, the community product is a single deploy target, and the enterprise - it’s like multiple deploy targets. Maybe you have multiple different instances, right? And so this is like completely changing the game on how deployment works, and so you have to have a thoughtful, coherent strategy for doing that, for dealing with scale. And this is one of those ones that I feel like deployment is hitting everyone, and something that they need to be really thoughtful about.
And you know, historically, the deployment process at GitHub and at many, many startups just depends on so much information in humans’ heads, right? Like “I made this destructive database change, and I know I can’t make the associated code change until the backfill has finished… And - oh, see that that backfill has finished, so now I will make this code change.” And that much information in a human’s head can work okay for a single deploy target, but when you have N deploy targets, forget about it. You’re done. There’s too much complexity to manage. So yeah, it’s interesting…
Is that the state of deployment right now, -ish?
No…
Okay. So has a lot of this been solved then?
GitHub is doing really good work. I would say it isn’t in good progress, but it took – this is one of those things where like “Oh, maybe thousand-plus person scale…”
Right.
…where you had to say, “Look, we can’t do this quickly.” There was efforts to say, “Quick, get this thing out the door”, and it was an example where it didn’t work. I’ll tell you, other sort of factors that happened were like – this is obviously an Azure cloud-based offering. You know, “We’re just gonna like follow Azure process.” Well, all of GitHub is using Pager Duty, and Datadog, and sort of like all the sort of tools you would expect, where Microsoft has all these custom alerting, monitoring frameworks, and it was like “Well, actually, I guess we need to like rewrite all our alerts in this other environment… And so now, developers are meant to be on call, and look at Datadog for this, but like this other system for this…” So that was just falling apart from a developer experience.
GitHub is doing really good work right now in this, and part of the key was a bunch of different strategies were tried using checkpoints… And you know, this is obviously something that - it’s a culture thing, too. I’m gonna say that every time, because – and one thing we didn’t talk about today, which we could talk about in another podcast is platform teams, and how you can’t expect magic platform teams to solve all your problems, because you really need to have product engineering involved in the work they’re doing, and how they work and so on… But every team is going to change how they do deployment at GitHub as part of this. And so it’s not just a magic platform team off in a corner who’s going to solve this… But the key for GitHub has been really decoupling database changes from code changes, and really seeing database changes through the entire system, before moving on to associated code changes. And so that slows velocity in some ways, and you have to work on the culture to say, “Okay, .com developers, maybe you’re going to be slowed down a little bit, but actually, this is for the greater good, and now your feature actually gets out to the enterprise product more smoothly, and so that’s a win for you.” So this is still in progress at GitHub. It’s not a solved problem, but I have a lot of confidence in the people who are working on it that they’re making great progress.
For sure. For sure. Well, a lot could be said, as you said just now, and we may have to do another podcast with you on more topics… Or have you back next year, or more frequently, now that we’ve had you on at least once. It has been great hearing all the behind-the-scenes and all the challenges that come with leading, but then also instilling the right culture, displaying the right clarity and expectation, the right documentation, the right kind of leadership… I think you truly are an example of that. And I’m so glad we had you on the show, because you got to put that on display, and that’s awesome.
[01:21:58.15] And now you’re on the next hierarchy of your career, advising, and doing fun things… I’ve gotta imagine that you have people reaching out, or there’s a way for folks to reach out. Is that something you’re advertising? And if so, feel free to advertise.
Oh, thanks. Yeah, I’m still figuring out what’s next for me, but I’m really enjoying getting to talk to a lot of different startups, and setting up some advisory roles, which has been really fulfilling. I will say, there’s one startup I’m working with that I just adore called EngFlow, and I’ve been an investor and advisor for them for a while. And they’re formed from two former colleagues at Google. Helen, the CEO, is a good friend, and it’s just incredible. And they were the folks responsible for bringing Bazel to the world, and now they’re doing amazing things for build and test optimization and developer experience. It’s so close to my heart. And they actually came in and did a hackathon in my basement last fall, and being able to be close to them and hear the excitement of everything they’re building was really part of what got me energized and thinking more about the startup world… So I have them to thank for motivating this change in my life as well.
But yeah, I’m really focused on developer and data productivity; those are passion areas for me, and I really feel like there’s a lot of exciting, important work happening in that space… So the companies I’ve been talking to are mostly in that space. And I do think I have some good insight in this 100-person-plus scale. So there’s a lot of eng leaders who are out there, who are struggling managing this scale for the first time, and I’d love to be able to help where I can.
And I’m enjoying my life quite a lot right now. I realized – I may have said to you, Adam, I felt like I’ve been grinding for 25 years, and I realized, “Gosh, I had never been away for more than one night with my husband since my 10-year-old was born.” And that’s embarrassing. [laughs] And so we’re fixing that, and just enjoying a little time…
Yes…
Yeah. It’s been really good. And so I’m definitely on a journey, living my one life, and trying to be happy, and still figuring out what’s next. So please do reach out to me if you want to talk.
There you go. Well, Rachel, it’s been an absolute pleasure hearing about your journey, and all the things you’ve learned, all the things you put into place as a leader, and we look forward to getting you back one day, someday soon maybe, for more. So thank you so much, Rachel. It’s been awesome.
Thank you so much. This was a lot of fun. I appreciate you both. Thank you.
Our transcripts are open source on GitHub. Improvements are welcome. 💚