This week weâre talking to Rachel Potvin, former VP of Engineering at GitHub about what it takes to scale engineering. Rachel says itâs a game-changer when engineering scales beyond 100 people. So we asked to her to share everything she has learned in her career of leading and scaling engineering.
Rachel Potvin: Yeah, great question. I feel like this is a podcast unto itself at some point, if we ever wanted to do that, because thereâs so many things⌠And itâs overlapping with culture, as is everything; thatâs going to be my answer for everything today too, but⌠An example where it overlaps with culture is like code review. I love the culture of prioritizing code review above your own work, right? Itâs not always feasible. Iâve definitely had problematic situations where a poor engineer in Europe woke up with so many code reviews in their inbox, because all of the Americans, right before signing out, were like âOh, heâll be up soonâ, and then that person would just be drowning in code review.
But in general, having code owners, and the ability to affect large-scale codebase evolution requires people doing effective code review. And a failure mode Iâve seen is where â you know, I had another principal engineer who was reporting to me at GitHub, who made a pretty simple change into â basically, to keep it simple, the way Go worked at GitHub. And so basically, everyone writing Go code at GitHub had to review his simple code review. And that should be fast and easy, right? But it wasnât. I needed to get involved to escalate for teams outside of my area to say âHey, after a month, you still havenât prioritized this code review. You need to do it so that we can roll out this change.â
And so really having good code review tools⌠Again, we talked about design review - very important. And then developer experience, and like at what scale are you going to start thinking more about your developer experience is really important from a code health perspective. Iâd love to tell you a little story about deployment at GitHub, because it really resonates with many of the startups that Iâve spoken to recently⌠GitHub got into trouble with its deployment strategy, and is on the right track now, thankfully, but itâs a surprisingly common story to see that in developer experience build and test times get longer, and thereâs test suites running that donât need to run, and so on⌠But like deployment is a particularly painful one, and I would say there are like three areas where it really hurt at GitHub. One was just the volume of changes got too high; too many people wanting to deploy. And so there, if weâre only considering GitHubâs primary deploy target, which is github.com, just the number of different people wanting to deploy changes on this fairly manual process that required human engagement, started creating friction.
GitHub has this kind of unusual âdeploy, then mergeâ strategy. So for code changes, you actually deploy your code first, check that everythingâs working, and then merge back into the main branch, so that main is always available for rollbacks. Itâs kind of an unusual strategy that I wouldnât necessarily recommend, because itâs part of the scaling challenge⌠But GitHub moved to using deploy trains to help with that volume of changes, and this is still very manual though⌠A conductor, who would be the first person who got on the train, would be responsible for shepherding the change.
[01:13:53.09] And then thereâd be all sorts of gamification that happened. I had a teammate who was like âWhy am I always the conductor every time I want to roll out a change to the monolith?â And itâs like âWell, because everyone was hanging back, waiting for someone to take that role, and then you jumped on, and you were the sucker who every time ââ Yeah. [laughter] And so this is a bad experienceâŚ
And, then I started hearing from people too, like âWell, I wonât even try to deploy something after lunch, because if I ended up being responsible for that - who knows? Iâm gonna be stuck till after dinner, waiting around⌠So Iâm just gonna wait till tomorrow.â And so you can see the sort of like aggregation of friction there, and how much that slows down development. Itâs just not acceptable.
In DevSat - I mentioned the satisfaction survey - deployment came out as the highest friction. And then like all these other side effects that affect code health, like people writing bigger changes, code review becoming more difficult, changes being deployed become more risky⌠So an increasingly problematic situation. And that was just for .com.
And then - and this is a situation that happens at a lot of startups, too. github.com isnât the only deploy target for GitHub. Thereâs GitHub Enterprise Server, which is an enterprise-focused product, where customers deploy GitHub Enterprise Server on-prem. And for them to do upgrades, they require downtime, right? And so the way this worked was they replay all the database changes, update the code⌠But database changes are unpredictable timing-wise. I already talked about how way too many database changes happen at GitHub because of partly active record, and sort of the way the monolith is sort of like not well componentized across the data layer⌠And so then GitHub Enterprise Serverâs customers started having an unpredictable amount of downtime for their upgrades, which is a problem.
Also, most of the GitHub engineering teams were really focused on .com. So âI got my feature out to .com. Iâm done. The ops team can deal with whatever.â So then this poor ops team is managing the upgrades for Apple, and IBM, and all these big customers, but also lots of small customers⌠Debugging becomes more difficult because âIs your feature in the enterprise server deployment, or is it not?â Thereâs a whole challenge with feature flags. We did a really fantastic tech debt cleanup actually around feature flags, where there have been so many feature flags at GitHub that were on permanently, or never been turned on, or on in the worst case scenarios that are like different configurations for different enterprise customers⌠And so that became problematic as well.
And then the third piece to the deployment puzzle at GitHub, which was really enough to say âStop. Weâve gotta really invest in how we do deploymentâ was on-prem enterprise product is not the state of the art; itâs not where most companies want to be, and so GitHub really had to develop a cloud SaaS offering for enterprise customers. And this is something GitHub has been working on for years. Thereâs a lot of pressure on it. Obviously, downtime for upgrades in a multi-tenant SaaS product is not a thing, right? And so there has to be a way to propagate deployment to that endpoint in a healthy way as well.
There was lots of pressure from leadership to get this product out the door quickly, and so GitHub did try to take shortcuts, tried various strategies to replay changes from .com to the cloud, and never could work, never could scale. Especially the frequency and unpredictability of the time required for database changes just made that untenable; like, how do you interleave code changes and database changes with the right timing, with the right lead time? The enterprise product would always end up getting so far behind that it could never catch up to .com. So that just wasnât working.