Go Time – Episode #257

How Pinterest delivers software at scale

with Nishant Roy, Engineering Manager at Pinterest Ads

All Episodes

Nishant Roy, Engineering Manager at Pinterest Ads, joins Johnny & Jon to detail how they’ve managed to continue shipping quality software from startup through hypergrowth all the way to IPO. Prepare to learn a lot about Pinterest’s integration and deployment pipeline, observability stack, Go-based services and more.

Featuring

Sponsors

SquareDevelop on the platform that sellers trust. There is a massive opportunity for developers to support Square sellers by building apps for today’s business needs. Learn more at changelog.com/square to dive into the docs, APIs, SDKs and to create your Square Developer account — tell them Changelog sent you.

FireHydrantThe reliability platform for every developer. Incidents impact everyone, not just SREs. FireHydrant gives teams the tools to maintain service catalogs, respond to incidents, communicate through status pages, and learn with retrospectives. Small teams up to 10 people can get started for free with all FireHydrant features included. No credit card required to sign up. Learn more at firehydrant.com/

Calhoun Black Friday – Go Time co-host Jon Calhoun is having a Black Friday sale on November 21st-29th. All paid courses will be 50% OFF. Learn more about Jon’s courses at calhoun.io/courses

Chapters

1 00:00 Opener 00:44
2 00:47 Sponsor: Square 00:54
3 01:41 It's Go Time! 00:49
4 02:30 Welcoming Nishant 01:06
5 03:37 Nishant's background at Pinterest 01:51
6 05:28 The work life of a Pinterest engineer 02:28
7 07:55 Pinterest's integration & deployment pipeline 07:27
8 15:23 Testing in production? 00:48
9 16:29 Sponsor: FireHydrant 01:18
10 17:57 Pre-submit tests vs unit tests 02:29
11 20:26 Pinterest's observability stack 01:32
12 21:58 Pinterest's Go-based services 02:29
13 24:27 On Go performance tuning 03:38
14 28:05 Maintaining velocity during hypergrowth 04:08
15 32:14 Tool selection criteria changes at scale 01:45
16 33:58 Standardizing tooling across multiple teams 02:16
17 36:15 Pinterest's documentation template 02:28
18 38:55 Sponsor: Calhoun Black Friday 00:53
19 40:03 Incidents help fill in the gaps 02:46
20 42:49 Mentoring junior devs 01:04
21 43:54 It's time for Unpopular Opinions! 00:32
22 44:26 Nishant's unpop 01:36
23 46:02 Opining on Nishant's unpop 06:06
24 52:08 It's too snowy for Jon to unpop 00:39
25 52:46 Time to Go! 00:25
26 53:18 Outro 01:01

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Alright, welcome, one and all. Today we have a special guest that is joining us to talk about scale. Scale stuff. But before I introduce him, I want to acknowledge my co-host who decided to join me last-minute, and I welcome him very much… Jon, welcome back.

Thanks for having me, Jonny.

So today’s guest is Mr. Nishant Roy. He is an engineering manager over at Pinterest, and Pinterest deals with a lot of scale, as you would imagine. There’s a lot going on over there, and we figured, hey, why don’t we bring Nishant over and talk about some of these things? And obviously, talk about the role Go plays in the mix. But obviously, I have to warn you, this is not going to be an all about Go kind of podcast. Obviously, Go plays a role in the kind of engineering they’re doing over there… But obviously, our conversation is going to be a bit broad, as far as things like CI and CD, and pipelines, and [unintelligible 00:03:28.29] that kind of scale is concerned, and hopefully that’ll be of interest to a lot of you out there who may be in a similar situation. So Nishant, why don’t you give us a brief intro, and before we get into it, what brought you to Pinterest, and what are you doing over there?

Yeah, thanks so much, Jonny. Nice to meet you, Jon, and Jonny as well. Hi, everyone on the stream. So yeah, I’m Nishant, I’ve been at Pinterest for just under five years now. I lead the ad serving platform team here. I started out as an intern actually, close to six years ago now, on the same team. The team at that point was just about ten people. My team itself now is 15. So talking about scale, the team has scaled tremendously as well with the product and everything around us.

What brought me to Pinterest is actually a little different than what I’d say most people’s answers are. I met some of the team on campus at Georgia Tech while they were there recruiting, and I really enjoyed talking to both the recruiting team, as well as some of the engineers and product managers who were there. As people, they seemed like people I would love to work with and hang around with, and also, just the problems they were talking about at that point - this is like late 2016 when I met them for my internship… Pinterest was just entering, I think, this hyper-growth phase from both a user perspective and a revenue/monetization perspective… So it just seemed like a really interesting place to go. And coming out as a new grad of college, I didn’t really know a whole lot, so I figured it’d be a good place to take what I did know, learn a lot more from those folks, and see what a company going through that hyper growth period does look like.

Awesome, awesome. So you’ve been there for quite a while, so you’ve had a chance to see the organization, the engineering organization, at least you’re part of it, mature and grow as well, yeah?

Yeah. And the big change, I think, was obviously going from a pre-IPO company to going public, and seeing how various things change around that, not only from a day-to-day engineering work, but also communications that go out within the company, outside the company… It definitely did really impact how we operate as an engineering org. The biggest thing, obviously, as once you’re public, there’s a lot more compliance that you have to go through, and being on the ad side of that, we deal with that pretty frequently.

Okay. So the life of an engineer day to day at Pinterest, working in your group, looks like what, generally speaking?

Yeah, so back when I started, again, the team was much smaller; the overall ads organization was much smaller, so things were moving a lot faster. We had less checks in place for things like – if we’re talking about scale and CI/CD today, we didn’t have as robust a system as we do today. People were able to make changes a lot faster… However, that came at the cost of a lack of proper verification in place. So we had a much faster feedback loop in one sense, that you were able to get changes out to production faster… However, the piece of feedback that was missing was, “Is my change actually gonna bring down production, or cause any major issues?” It was not completely missing, but it was definitely less sophisticated than it is today.

[06:13] So at that point, there was less process, essentially. If you had your own idea, go write the code, you essentially just needed one person to sign off on the code review… Less compliance and blocking reviews at that time. Go ahead, put your change out, present your metrics, your experiment, your A/B results essentially to your team leader, your org leader, and as long as you get approval from a couple of people, you were good to go. So that was what a day in the life looked like back then.

Now there’s a lot more process; not necessarily a bad thing. I know process has a pretty negative connotation… But what it means - we have way fewer severe incidents at least, and that means more people at the org have an understanding and a say in what changes are going out and what reason they’re going out for. So essentially, like I mentioned earlier, you just wrote up this doc with your results, and got approval from a few folks, and you were good to go. Now, essentially, there’s a more robust and involved process to get, where there’s a forum that comes together to review the changes, and there’s healthy debate around why a certain decision is being made, and what the rationale for that is… And ensuring that everyone from different [unintelligible 00:07:13.24] and the product side are on board before any major changes like that go out that may impact pinners, us internally, or our partners or advertisers.

Okay. I imagine that some changes require more scrutiny than others, right?

Definitely.

So if it’s a quick bug fix, you probably don’t go through that same extensive process every time, right?

Yes. So we have a more lightweight process for smaller bug fixes etc. And the criteria for that, essentially, is as long as there’s not a significant change to any of the 10 to 15 top-line metrics that the org has decided are vital to monitor, that change requires the old process, essentially; you get approval from one or two folks within your team or your org, and then you’re good to go.

Okay, so part of being able to deliver changes, big or small, is having some sort of an integration and delivery pipeline, right? So it’s the same process, I would assume, whether the change is small or big; you’re going through the same – from a technology standpoint, from an automated process standpoint, it’s the same thing. There’s no different way of doing things if it’s a small change or a different way if it’s a big change.

That’s right.

Okay. So with that said, what are the stages of this pipeline? So I’m thinking, generally speaking, most people only really have – they do their unit tests locally, and there might be some integration tests that happen in the cloud, and maybe there’s a staging environment involved… For most people, that’s sufficient. Right? So basically, what are the different stages in your pipeline, and why do you have all these different stages?

Yeah, and I’m happy to say that this is something that’s evolved a lot as we’ve scaled as well… So just backing up again to what I do - as part of leading the ad serving platform team, our responsibilities are to enable the ads team to continue to grow and deliver and launch more products, new algorithms to improve ad delivery and efficiency on the platform. We’re over a $2 billion company now; that all comes from ads, so that’s part of what we do.

The other side of that obviously is, again, going back to scale and compliance, making sure that our systems are scaling at an ideally sublinear cost rate. So for every dollar made, we don’t want to be spending an additional dollar; that’s one condition. And then also, as the number of engineers grows, and the number of changes grow in our system, we don’t want to be having as many or as frequent or as severe outages, at the very least. So given those requirements, having a robust CI/CD system, integration testing, staging environment, etc. incremental rollout became extremely important.

So what this was like when I joined Pinterest was - I’m proud to say that the ads team was one of the first teams to have a continuous deployment process, so there was a pretty good foundation for me to build on top of. Essentially, when you committed your change, within let’s say the next 30 minutes it would hit a canary environment. That canary environment was essentially just, let’s say, one host, and the goal of that host was to ensure – and this is a Go service. So the goal of that host was to just make sure that your service is not panicking and crashing. If it did, it would trigger an alert to an on-caller, who would then go in, look at the cause for the panic, roll back the change, and ensure that it doesn’t go out to further stages etc.

[10:12] Once it passed that initial canary test, which is about a 10 to 15-minute test, it would go out to what we call the staging, but let’s say like a larger canary essentially, which was at that point let’s say about 1% of the cluster, of the entire production cluster. This is all serving production traffic, so now on these 1% of the cluster we can actually monitor more application-level metrics, whether that is number of ads inserted, CPU usage, memory usage, disk usage etc. At that point I want to say we were monitoring probably about 70 to 80 metrics. If any of those metrics showed a significant regression - and we could tune the thresholds for each of those alerts - it would trigger similarly an alert to our team’s on-caller, who would have to go in and manually pause deploys, and the whole process of debugging from there on forth.

What we’ve evolved a lot since is we no longer require the on caller to go manually pause deploys, or roll back deploys. We have a system in place, it’s built on top of Spinnaker, which is an open source platform from Netflix, if I remember correctly… And essentially, your CD pipeline will automatically pause the deploy if any of those metrics show a significant regression. So that was a big win, because one, it reduces stress on our on-callers, two, it actually reduced the number of incidents that could have just been prevented by someone going in and clicking the button at the right time.

Besides that, what we built in - you mentioned integration testing, and the fact that some people sometimes will run some local testings and push out their changes and wait for feedback in production… We realized that was actually one really big gap, because there was no great way of enforcing that people were actually running that local test… Because that essentially required the code reviewer to follow up and ensure that sufficient testing was performed. And again, like you said, depending on whether it’s a small change or a large change, the burden on the reviewer changes based on that as well.

So we wanted something more uniform, that applies to every single change. Every single code change runs in a reasonable amount of time, doesn’t require developers to be sitting there for an hour while their tests run, and then gives us reasonable confidence that if this change goes out, we’re not saying that it’s definitely not going to cause an outage, but at least we’re not going to have a significant outage that’s going to bring down the service, bring down the site etc, etc. So one thing we built is what we call a pre-submit test. That was actually one of my first projects when I joined the company. Essentially, every request that comes into Pinterest, we log a sample of that – or every request that comes into the ad system rather, we log a certain sample of that to a Kafka topic. And then when we want to run this online integration test, essentially every time you put up a new PR, we package your changes, similar to how we would for a production build. Package your changes, create an artifact that this pre-submit test framework then deploys for you to a couple of test hosts. So now that we have that log traffic, we can essentially tail that Kafka log, get some number of requests, send them to these hosts, simulating production traffic without actually affecting any users… That host then emits its own set of metrics, and we can essentially grab those metrics, compare them against a production cluster, or at least a different host that is running the latest version of your main branch… See if there’s any significant regression, and obviously, we’re losing the thresholds for that, because we’re only running this test for anywhere from like 3 to 10 minutes. And if there is a regression, then your commit is essentially blocked from landing until we resolve those discrepancies.

We wanted to make sure, and I think we successfully achieved this, that this test is able to run in less than 10 to 15 minutes… Because one of the great parts of Go in our relatively smaller codebase at that point was that our entire repository, or entire service at least, was able to build an artifact in about, I want to say 4 to 5 minutes. So if we suddenly introduced this test that ran for 30 minutes, developers would have hated us. Our developer speed just went up almost 10x, so that was one of the things that we wanted to guarantee, is that this test essentially ran in 10 minutes, you did not need to wait for more than 10 minutes for all your Jenkins builds, your unit tests to pass, and for this integration test to pass before you were able to make your change.

[13:57] Those were the two key parts that came out, was this pre-submit integration testing, which allowed us to actually define which metrics need to be monitored, and then secondly was this automated canary analysis essentially, which is, I think, an industry-wide practice at this point… But that really saved us a lot of headache of manual process, and reduced the stress on our on-callers. And on top of that, we’ve now seen a lot of teams actually adopting these frameworks add in their own metrics. So for instance, I mentioned earlier, when we had this sort of continuous deploy process in the early days, we monitored about 70 to 80 metrics. That number now - I haven’t looked at it lately, but it’s anywhere between 500 to 700 metrics now, which obviously doesn’t come all from the infra team; it comes from a lot of product teams, and machine learning teams being able to onboard their own metrics and have the confidence that these metrics are a) stable enough, b) actually protecting our systems.

Similarly with the pre-submit test themselves, when we first rolled it out, the infra team just configured 10 metrics, I think, which are mainly system performance-related. Now we’re up to about 90 metrics, added by all sorts of product and quality teams to ensure that their particular slice of the ads pie doesn’t go down. So for instance, if we’re – I mean, we could even do it to the granularity of checking the number of ads inserted for simulated dark users coming from Canada, or Japan, or something like that. We would obviously have to increase our sample size to get a meaningful number there, but this framework gives you that level of flexibility, which was widely adopted, and has really helped us a lot.

So it sounds like you’re kind of testing in production, pretty much…

Yes and no. So the ACA, the Automatic Canary Analysis is happening in production; our pre-submit testing is not happening in production. It’s essentially replicating production traffic and sending it to a couple of dark side of hosts, which is not actually serving any actual user traffic.

Okay, so you simulating what would have happened had it been hit by the actual traffic. So I’m assuming that traffic basically was captured from production, actual production environments, and then you just basically replay it against the canary.

That’s right. We essentially just log 1% of all production traffic to a Kafka topic, keep it around for about two or three days. And that’s constantly being refreshed, and we can constantly use that to replay and simulate what would happen, like you said, if that binary was being served to production users.

So when people think about testing, I think a lot of the time the first thing that pops in their head is like unit tests, and these smaller things that happen offline… Would you say that introducing all of this stuff has caused developers to focus less on that, because you have these basically production or production-simulated-type tests that are at least a lot more realistic sounding, so would you say that like the unit tests get a little bit less focused then? Or how does that change the dynamic?

A little bit, yes. When we first rolled it out, especially in the first couple of years, we did see that. I mean, I personally saw a lot of PRs coming in where folks were just like “Hey, pre-submit tests passed. I’m not gonna write unit tests for this.” And to be honest, that was not the worst thing in the world for developer velocity, to some extent. Like, you’re guaranteed, or you have a fairly high rate of confidence that your change is safe, so why take that extra time to write unit tests, when you might be doing something, writing a new feature instead?

We did however – I’m trying to think of a few examples; I’ll come back to you if they come to me… But there were a few instances where things sort of did pass through our pre-submit test framework likely because they weren’t impacting something top-line, but something perhaps offline. For instance, if we didn’t have proper validation of the data that was being logged for - whether that’s offline analysis, or machine learning training jobs, or billing and reporting pipelines, all that stuff, those were things that may not necessarily be monitored by this framework, since at the start, like I said, it was mostly just monitoring ad insertions, and things like that. So we did realize that, while this is great, it’s not sufficient, and we do need at least some baseline level of test coverage, or at least local integration testing, to capture those things as well.

So seeing what you’re doing now and seeing how it works, at least from my perspective, this sounds like something where at scale, this approach works very well, but if you were just starting up from like a smaller business or something, this is one of those cases where if you tried to mimic what a big company was doing, it wouldn’t work at all, because you just don’t have the volume or scale… Or if you’re only getting 100 web requests a day, you can’t trust your production in 1% of your servers to actually give you any real information. So would you say that this is definitely one of those cases where as your company gets bigger, the approaches you can use sort of change and adapt based on your circumstances?

[20:06] Completely. Like you said, if you don’t have a large enough sample size, this sort of testing is not really going to give you meaningful results; it’s just going to be a coin flip at that point whether your test passes or fails. So having those more deterministic unit tests that actually test the app behavior on one single request is a lot more important at an early stage.

You haven’t mentioned what you’re using for your observability stack. So a lot of things are being captured, metrics, your team are adding things based on features or products… Like, is this something home-grown, or are you using off the shelf software to provide that observability?

Yeah, so for general observability at Pinterest, metrics are essentially stored in an OpenTSDB backend, and we have an internal tool that is called Statsboard I believe there’s a blog post about it. If not, I’m sure there will be at some point… It’s a super-great team. It’s one of my favorite teams to work with. Statsboard essentially very much like Prometheus. I haven’t used Prometheus a lot myself, but just a UI to visualize time-series metrics. It also allows you the ability to define alerts based on different thresholds etc. and that’s what we’re using for observability at a company-wide scale.

For pre-submit tests specifically we have a slightly more custom solution, because we didn’t want to go through the whole hassle of essentially – since I mentioned that we were doing it is every binary, or every PR, that we package it into a deployable binary for testing would get deployed to a single host, or just one or two hosts. So we needed a better way of essentially isolating metrics for those two hosts, rather than needing to go through this UI, and filtering for those two host names in particular to get the right set of metrics.

So for those hosts, we used essentially the expvar library from the Go standard library to expose all metrics through an HTTP endpoint, and essentially, we could then scrape that endpoint and get all the metrics that were generated on that host as one large JSON blob, essentially, parse those in our pre-submit test framework and use that for the actual metric regression analysis.

Very cool. So how much of your stack is Go-based services?

Only the ads team uses Go heavily and for online serving; so the ad delivery and ad logging systems are in Go. Pinterest broadly is more a Python, Java and C++ based infrastructure. So our front, user-facing API is in Python, a lot of our backends are in Java, unless they need to be– Unless there’s like, low latency, high-performance requirements, in which case they’re in C++.

The ads team - I don’t know, the historical reason for this; this happened like seven, eight years ago, so like three years before I joined… But my understanding is it was a couple of things. I think we realized that the ads stack needed to be pretty low latency, but also required pretty high developer velocity… So essentially, my understanding is the team was debating between Java versus C++. Java had its latency concerns at that point; garbage collection wasn’t as advanced as it is today… C++ had developer velocity concerns at that point, and I think we had a couple folks on the team at that point who felt very strongly about Go. And again, Jon, like you mentioned, things happen differently at an early-stage company versus when you’re later… So 2013-2014, Pinterest was probably like four to five years old, I want to say. I don’t think we had as refined a process for choosing the frameworks and languages we develop in as we do now.

So that’s kind of how it happened, and we arrived at using Go… And it’s been pretty great. I think for the longest time we’ve hit the happy middleground between developer velocity and performance. And Jonny, when we met at the Baltimore Go meetup, we talked a little bit about the challenges we faced with with Go efficiency, as well as on the garbage collection side… And more recently, I’ve been hearing through the Go Enterprise Advisory Board that this is becoming more of a known concern for large-scale companies.

Now, the Go team has put out a few flags recently to allow you to better tune the garbage collector. In Go 19, for instance, I think most recently they added the ability to tune your soft memory limit, as opposed to tuning just the Go GC value, which I think is a great way of allowing developers to essentially better control the impact of garbage collection on their systems. I don’t know if you guys have seen this, but think Uber last year or so put out a blog post about how they dynamically tune the Go GC environment variable itself based on system metrics… Which worked really well for them, I think. If I remember correctly, they saved millions of dollars in cloud infra costs… But it’s a little hacky. It’s not the ideal way that we want to be managing our infra, so I’m glad that the GO team is always listening to those concerns and putting out new features to make it easier.

[24:21] Right. Adding some official tooling for tweaking the garbage collection process.

Exactly.

The Java world is sort of notorious for all the flags and optimizations, and all the bells and whistles, all the buttons you have access to for GC tuning… So then hopefully, Go doesn’t necessitate that level of customizability. But again, as you say, the fact that the Go team is listening, especially for customers that have – basically, if you have to tweak GC settings, then you’re doing it at a scale that most people simply aren’t right, right?

Right.

It’s not often where we need to basically tweak what Go does out of the box. So this may not be a concern for a lot of people, the vast majority of people, but for those that do need it, it’s good to have non-hacky ways of going about tweaking those things. So that’s pretty cool.

Exactly, yup. I spent about six months of my life, I think two years in, strictly just analyzing how GC was impacting our service, and how to improve it. I’ve written a couple of blog posts about that as well. It was really interesting, it was fascinating, but I know now with this new flag I could have saved at least four months of those six… So here we are.

It’s kind of nice that they took their time with it, at least a little bit, because when Go was first released… You have to imagine - at least in my mind - it’s young enough that they didn’t have enough customers using Go in high enough production environments that they really had enough data to decide what needs to be done, versus what necessarily doesn’t… At least that would be my perspective as an outsider. Maybe I’m wrong.

No, I agree with that.

So it’s nice they’ve taken their time and they’re trying to figure that out as people are using it at scale… But that’s something that in my mind would be very hard to have from the get-go, because it’s like trying to optimize a page when you have 10 users, for a million users. It’s like, I have no idea what my bottlenecks are actually going to be at that point.

Yup. And the good news is they did have sufficient tooling for you to understand your system usage, if needed. That’s something that came packaged in from – I guess I don’t know how early on, but things like pprof, and memstats, etc. that allow you to actually understand how your system is performing, analyze that data exactly how you need to understand the bottlenecks in your system.

One thing I’d love to see, if the Go team is listening - if we can have an official guide on how to use these tools a little more; maybe some tutorials for people starting out who have never used flame graphs before, who don’t understand how profiling works, or how to best read it. I think that’d be a really great way to further the adoption of these tools, and make it much simpler for everyone to understand how to tune their systems.

To that point, there’s an interesting dichotomy, I think, within the Go committee right now… Because we’ve had talks and blog posts and things written on performance tuning and optimization for Go in general, be it in terms of analyzing allocation, memory usage, how many threads you have running at any given time… Like, all that tooling exists. But we also tell developers “Well, don’t worry about premature optimization”, right? [laughs]

Right.

Which means like “Don’t really look under the hood that much, don’t run pprof, and don’t analyze what your application is doing, because that might be premature optimization”, right? “Only go looking for these tools if you suspect you have a problem with performance.” But at the same time, a lot of times you end up building things perhaps in a sub-optimal way. And that ends up pretty much receiving more traffic than you anticipated. Or maybe you go through a growth period, kind of like Pinterest did, where once upon a time you were a small organization that was not public, and now you are public, and you’re getting a lot more traffic… Those things - you don’t go back and rewrite some of those things you were earlier on just because you now have more traffic, right?

[27:50] So you end up with this sort of a legacy built-in performance, suboptimal processes and tooling and services that are doing things that is way more than they were designed to do. Now you kind of have to have an engineering effort to refactor things.

I’m curious if you’ve experienced this need to go back and change things and refactor things and make things faster, and obviously, how Go made that easier or harder?

Yeah, I think you’re right. I mean, it’s an impossible decision to make, right? Because on one hand, if you’re spending too much time on optimizing early on, you’re not going to make it as a startup, especially given this environment we’re heading into now… Speed of everything is of most importance. On the other hand, like you said, you end up with these legacy systems that don’t then support hypergrowth, or supporting new products.

I think one big thing that we’ve seen over the last probably five years to a decade or so is widespread adoption of video. So folks or companies who didn’t really optimize early on for, let’s say, content delivery for instance, suddenly you’re delivering video, which is obviously a lot more expensive, and people are having to reconsider how they built those systems, or in some cases even rebuilding those systems to operate at this new scale that customers now want.

For us, for my team in particular, the biggest scale challenge we were going through was the team growing. So like I said, my team specifically was ten people, now the entire ads infrastructure org is probably getting close to about 100. The ads team is probably three to four times that number. So the question we were asking ourselves is now that we’re going through this phase of growth, building a lot more products for a lot more users, for a lot more partners, how can we continue to keep that same level of velocity that our pinners are used to, our partners are used to, and us as engineers are used to, with this growing org, without causing more incidents or without having people stepping on each other’s toes? And I think that is, to some extent, the eternal quest; we’re never going to really hit a perfect happy – I don’t know what the right word is, but middleground between all three of those factors. But that is something we’re constantly evaluating and tuning for, is how do we build our systems in a way – either can we build better modularity, can we have more config-driven systems where folks can make changes without needing to understand how the rest of the system works?

For instance, for rolling out new experimental models, do I need to actually go and read exactly how those models are going to be chosen? Do I need to understand what features have been fed into my model? Do I need to understand how those scores have been used? Or can I just go in and make a simple JSON change or something, and say “For this subset of traffic, which may be country, or surface, or device, for iPhone users coming from Canada on the search page, I want you to use this model, with this percentage of traffic. And I don’t care how the rest of the system works.”

So those are the things that have been really on our minds constantly, and something we’re looking to continuously improving. In fact, the pre-submit test framework was one way that we did that, is anyone can go and make a config change to add a new metric and add a new slice, without affecting how any of the rest of the system works.

Our actual ads delivery system isn’t fully there yet. We’re continuing to improve this for developers, making it easier for them to make changes, a) without bringing down the system, b) without blocking other folks or needing to understand how your change interacts with someone else’s. But like I said, it’s the eternal quest, as the product offerings from the sales side and the product teams get more and more diverse, we realize that some parts of our system just weren’t built to support those sort of products in mind.

Something as simple as if we need to serve - these are some of the problems that we’ve solved perhaps, but if we need to serve both video and image ads for the same request, do we have a good way of doing that? Not a particularly hard thing if you’re building a system from scratch, obviously, but once you’ve built it with a certain assumption, just these basic things sort of start to fall apart, and then you need to go and look into “Should I just be redesigning this whole thing, or is there a quick and easy way for me to get this off the ground, and then go in and redo it to unblock the next big thing?”

[31:48] And again, what ends up typically happening - I think this is probably true for most big companies - is you obviously need to get that MVP out, so you do something a little hacky to start out with, and then you retroactively go in and ensure that a) it’s not gonna break anything, and b) how do we make this a more pleasant experience for developers and product managers alike? So that’s the wheel that’s always spinning, and we’re trying to stay ahead of it, but we’re usually playing catch-up.

Jonny had asked you about existing services where you didn’t take all that stuff into consideration… Do you find yourself now when you’re building something new that you actually look at things like flame graphs, and garbage collection, and try to optimize those things upfront, now that you know you’re at the scale where that stuff actually matters?

Yes, and the reason it’s become a lot easier is because our amazing infra team Pinterest-wide has actually provided better tooling for us to do those things more automagically out of the box. So rather than us needing to write a manual script to go run a profile on our Go system, we essentially have a central system now that we can essentially enable profiling through. So I think those are the sort of investments, once again, that when the company hits a certain critical mass, and we have a team that we can afford to invest in building tools like that, it becomes a lot easier. But till that point, I’d say no; until we had this company-wide tool, it was still up to us as the ads infra org owning our own services to decide when and where we needed to do this.

And honestly, a large part of it was driven in a voluntary manner; folks who were interested in thinking about performance, and saving in infra costs, saving in latency, would sort of just on their own time go and run these profiles and identify hotspots. Alternatively, if we hit certain system limitations. So let’s say modeling team X wants to launch this new model that is suddenly a lot more expensive… Expensive in terms of either dollars, or milliseconds of system latency that could be affecting the end user experience. It’s only at that point that we would then go in and look at, “Okay, you want to add 15 milliseconds of latency to the end-to-end system. We only have 120 total. Can we save this somewhere else, or can we optimize your request in a way that would that brings that number down to minimize our impact?” But this is something that we’re now looking forward to be more proactive about.

Did you suffer from – like, during this growth phase, did you suffer from team silos? Basically, a team out of necessity having to create their own tooling, adopt their own ways of doing things, and then now trying to do this company-wide having to disentangle teams from the way they know how to do things, because they built things from scratch, or they bought something and have been using it? How challenging was that to sort of rise up to a common level of tooling for everybody?

Yeah, to some extent. So definitely for us being on the ads team, like I said, we were the only major online service in Go at that point, so we were definitely building our own tools, we owned our own deploy systems, our own testing systems etc. There has been a divergence there, in how different teams on ads vs. non-ads do things.

Within ads itself, I think one thing we saw is that because of a lack of that centralized framework from our side, from the ad-serving infra side, different teams sort of built their own products, even within the ads stack; they wrote their own code. And that caused some divergence in a few different ways. So the code quality is not always at the mark that we would like it to be. There’s not exactly the same level of test coverage everywhere, or things might just be done differently. So for something as simple as if we’re sending a batched request to a different service, or data back in or model inference backend, that logic might be implemented differently by different teams, which then makes it really hard when those folks leave, when things break, to go in and understand exactly what broke and why. So one thing we’re trying to really do as the ads infra team now is standardize those frameworks and tools for at least all the different ads product teams to use.

So going back to batching as an example, my former manager just essentially took one day and wrote a library that made it really easy to run things in batches in Go. So taking away the requirements, but not the ability… Taking away the requirement for engineers to understand how goroutines work, and do so in a way that is thread-safe. So building that batching library essentially allowed them to use this interface and be guaranteed that there was proper panic recovery, there was essentially an easier way to enable locks, if needed, and you didn’t need to write your own concurrency code yourself. Things like that have really helped standardize practices across the different verticals on ads itself, and make it much easier for us to maintain and grow the system going forward.

[36:13] How do you record and communicate infrastructure decisions within your teams, or even to broader teams, other teams that might learn from them?

That’s a great question. Overall, Pinterest now has a much better standardized documentation template for these large, new services or new changes that are being rolled out… And all of that information essentially is saved permanently on our internal drives etc. For sort of smaller things… So if my team were to go change how we do a certain thing in the ads stack, I think there’s a couple different ways. There’s no good central place, I’d say, where you can go and find all the large infra decisions, most likely. What we do is we have a Production Readiness forum, essentially, within ads itself. That involves some senior leadership from the ads team, help from the SRE org, and then representatives from the various infra teams, who will essentially evaluate your change. So you’re required to come prepared to this meeting with what you’re changing, what the stack of your change is, what are the critical metrics that you’re monitoring, what services might you be affecting, what failure scenarios do you foresee, what are the mitigation strategies, etc. And then that one-hour meeting for Production Readiness essentially becomes a forum where everyone can sort of test you, to some extent, and test your plan to see whether your change is – at least do you have enough of a plan in mind to make sure your change is safe to go out in production. And then that becomes a way to make enough people aware of these large decisions. That’s sort of worked well for us so far.

The gap there is obviously in an environment where all those folks suddenly leave at a similar period of time. Hopefully, that doesn’t happen. It would require somebody to go in and read those Production Readiness documents again, to understand the reasoning behind everything.

For larger product launches, we have an email alias within the company where essentially every time you want to launch an ads experiment that is moving some significant top-line metrics, you need to send out an email to this alias before you get launch approval. And at least that serves as providing everyone in the monetization org some visibility into the changes that were made and why. Those emails are then further linked to very, very detailed 50 to 100-pages-long documents, in some cases, going through all the various steps of analysis that went into it. They’re usually about 15-20, but I just saw one that was like 93. It’s mostly just graphs and tables, but still. All that knowledge is recorded somewhere.

I think what’s worthwhile is probably this behavior of just publicizing it to everyone once the decision is going out, so that even if the top 10% of the leadership team leaves, there’s probably at least enough people who know where to find the rationale behind that decision. So I assume that’s what’s worked pretty well.

What does an environment like this look like for a junior developer? If you’re looking to make impact within this system, which on its face seems very intimidating if you’re not familiar with any such process - so maybe you’re out of school, maybe you’ve only spent a couple of years in industry… And here you are, faced with this process. How do you shepherd a more junior member of the team through this? How do you go about that?

Yeah, so having been there myself, one thing that worked really well for me, and I encourage a lot of folks do this within Pinterest and other companies, is the biggest opportunity you have for understanding gaps in the system, whether that’s performance-related or stability-related especially, are incidents. So getting involved, when there’s an incident – you don’t actually have to do anything. You just see that something’s broken, you join the channel, or the Google Meet, or whatever it is where that issue is being discussed… You sit there quietly, listen to what people are saying… They’re going to talk about the system. After the incident is done, someone’s going to write up a very detailed description of how is the system expected to work, why did it not work that way, and what are we going to do about it. That is one of the most valuable sources of knowledge, because it’s usually the system owner or an expert on the system doing that piece… And that shows you what the gaps are. And that I think benefits you in two ways; in terms of impact, and some extent visibility, certain issues reoccur. Even though we have remediation tasks to make sure that issues don’t reoccur, some things just happen over and over again.

I about six months ago saw an incident on the ad side that I remember happening 4,5 years ago, when I had been here only like six or seven months. When I saw some of those metrics, it just sort of jogged my memory, like, “Hey, this happened a long time ago, but it happened before. Let me just go in and see what we did that time, and we can try and redo that and see if it works.”

So I’ve seen that work for several folks who are new to the company. It’s like “Hey, I saw this happen two weeks ago. Here’s what Nishant did to fix it that time. Let’s try this again.” And then suddenly, you’re the hero that saved the day. And even if that’s not necessarily what happens, as part of the follow-ups, taking on some of those remediation items is often something that is not glorious work necessarily, but can sometimes have really big payoffs. It’s like, I made this one change that prevented a malicious user request, or a bad user request, or any corrupt data from bringing the system down for the rest of time, potentially, right? And that teaches you a lot about understanding faults in the system, understanding how we analyze those faults in the system, and how to plug those gaps.

So I’d say that’s probably one of the easiest ways… If you have absolutely no knowledge of a system, just get into these rooms with people who are discussing everything about the system, and hypothesizing about all the reasons why something might break. It’s just one of the quickest ways to learn a lot about different parts of the company’s tech stack.

And if you have a suggestion, big or small, do you usually pair up with somebody more senior, who can help you through the vetting process?

Yeah, so for my team, and I think a lot of teams at Pinterest usually, everyone new especially is assigned an onboarding mentor, who’s typically the more senior person on the team. So that’s usually a good sounding board to start with, like “Hey, I have this idea. What do you think?”

For my team specifically, we have a bi-weekly team meeting where essentially for 20 minutes every two weeks everyone is encouraged to – there’s other parts to that meeting, it’s an hour-long meeting, but in these 20 minutes, it’s like open floor, just share anything that you’ve found interesting, any problem that you faced, or any question that you have. So I saw this thing happen, I have no idea how this works, the way it works.

One example for me early on was understanding how the RPC library in Go works, for instance. It’s a little bit of magic how those functions get registered… So bringing those questions up and giving people the opportunity to get answers to those questions, and just bringing up those suggestions and how they can improve systems, team processes etc. has worked really well for us to essentially get visibility and get answers.

Awesome. Jon, have you got anything before we switch it over to unpopular opinions? Because Nishant said he brought the heat, so I’m looking for it… [laughs]

Nothing else comes to mind. I’m ready to hear this unpopular opinion.

Nice, nice.

Alright, let’s hear it, Nishant.

This might actually be a really popular opinion for this pod… But my unpopular opinion is that working with non-typed languages, non-compiled languages can be a nightmare, especially as your codebase grows extremely large. I will say, shout-out to the Pinterest team. I think they’ve done a fantastic job of making our Pythonic API easy to work with, easy to read, great documentation, great testing… However, still, if I just want to go in and read one bit of the code, it’s really hard to understand exactly how it works, because I don’t know – for lack of typing, I don’t know what’s going into this function, unless things are really well named, which at this scale becomes hard to enforce… It becomes a little hard to test; you really need solid unit tests for everything, down to like input parameter validation, which is something you don’t need in a typed language. And I think it becomes really easy to like accidentally introduce some bugs.

I know for a fact that one of my friends in the past who worked at Pinterest introduced one bug where they accidentally reused a variable that was defined elsewhere, overwrote that variable’s name, because this file was like 900 lines long, and Python didn’t throw up any flags for using this variable, like Go would… And it essentially caused a – I’m not gonna go into the issues, but it caused an issue with the content users were seeing. And that’s something we didn’t realize till the change went out and was live for a few days. I mean, it was restricted to internal employees only, but it essentially caused – it was something that I know that Go would have saved us like three days of wasted time, essentially. So that is my unpopular opinion; industry-wide it may not be an unpopular opinion for folks listening to this pod, or with you two…

Good. Jon, what do you think?

I mean, I feel like what he’s saying has a lot of truth. What makes those languages great at small scale also makes them challenging at large scale. But I would assume most people getting into those things kind of understand those trade-offs… But I don’t know, maybe some people completely disagree.

My opinion is that there’s always going to be somebody who is of a different opinion… But I think opinions, and perhaps - well, this is not an unpopular opinion, but I think opinions should be allowed to change over time. Right? Once upon a time, I was a die-hard statically-typed language person, then I discovered dynamic typing, I’m like “Oh man, Ruby’s the greatest thing since sliced bread. Python is awesome” etc. And then I swinged back around and now I’m a Go die-hard because I know what my types are, and I know what to expect in places.

So I think it’s okay to swing back and forth, but I think the better approach, which I think is harder than most people give it credit, is to have a measured approach to how you make decisions about things, to how you judge things to be nor good or bad. There’s no such thing as absolutely good, or absolutely bad. You kind of have to say, “Okay, well, what problem am I solving?”

So today I might use dynamic typing because I’m able to move a little faster, because maybe I’m building a prototype, maybe I’m building something that doesn’t need to be statically-typed for reason X, Y, and Z, and you know what the decision is… Which is why I love decision records, right? Including why you pick a framework, or why you pick a language, or what velocity you think you’re going to gain from having made these decisions, right? Because if you look back, three months, six months, six years from now, and look at, “Okay, this is why this decision or these decisions were made. This is why we picked Rust. This is why we picked Lua. This is why – whatever”, and you can explain, you can provide the context within which you were making that decision, the information you had at the time, then I think it’s perfectly fine to argue for and make the case for using anything you want to use, right? As long as you have the rationale for it, and your teammates at the time understand what your reasoning is, and they’re behind it.

[48:06] The disservice we do ourselves is by basically being fanatics of one thing. And I know sometimes I myself sound like a Go fanatic… Because I love Go. I do it every day. It’s my favorite language. Hopefully, I’ll be writing Go into my retirement. But I know that Go is not always the right tool for the job, so I have to check myself whenever I see somebody coming in with a Ruby, or Java, or whatever it is, to not say, “Oh, we don’t want to use those languages, because they’re not Go.” That kind of ego I think should be left out of technology decision-making… Because technology is all about trade-offs, and you’re gonna experience trade-offs, whether it be a language, or framework, software, you buy off the shelf, or software you build yourself, how you run your software, how you’re building infrastructure… Everything is a trade-off. So I think we have to be more gracious with our teammates who may like something different than us, and we have to be level-headed, and understand that there’s context for every decision made.

I mean, I can definitely agree with that… There’s always trade-offs with all that stuff. It’s also interesting – as Nishant was talking about it, I guess, I was also thinking about the fact that depending on when you join a project, I think it can sometimes skew your opinion of whether something’s more complicated. Like, if I’m working on a Ruby project that I’ve worked on my entire life, like 10 years or something, and I know really well, I don’t find it very complicated, even if it’s gotten very large… But that’s because I know it like the back of my hand at that point. But I could completely imagine cases where somebody new coming in, like “I want to make a change”, and then looking at it and being like “I have no idea what’s–” Like, it’s much trickier. Whereas like in that case, I think sometimes a statically-typed language like Go is a lot easier to jump into with no context, no idea of what everything does, because it’s a lot more explicit… But I mean, I completely agree, Jonny, that generally speaking, as long as there’s good reasoning behind whatever you choose, there’s not “One’s better than the other.” But I could definitely see the argument at this point being “Okay, we’re a large company. At this point, it seems like the statically-typed ones are more stable, they’re easier to bring new developers on, so even if we think it’s gonna be a little quicker in Python, maybe it makes more sense to use the statically-typed language.”

Right. And it’s so dependent on the stage that you’re at… Because tomorrow if I was starting a company, needed to hire five folks, how easy or hard would it be for me to find five great Go developers, versus five great Python developers? It’s just night and day. I know I could find five Python developers probably in a week, versus Go - one, it’s actually hard to evaluate too, right? Like, if I want to evaluate how well a person’s writing Go, it’s not as easy as like one quick interview, versus Python’s probably a little easier, because I know you can – even if you’re not well-versed in everything, it’s going to be easier for you to get started and do 80% of your job, and then ramp up on some of the internals as needed.

The question then becomes “Is there ever a right stage for a company going from 10 engineers to 100 engineers to 1000 engineers - when and if do you decide to go from Python to Java, to Go, to C++, whatever that is? And is that a good idea? Is that a bad idea?” I don’t think there’s enough data that exists in the industry to truly answer that question.

There’s a couple of pieces floating around around rewrites essentially not being the best idea in the world always, but I’ve seen cases where rewrites have actually paid off tremendously. I think Snap rewriting their Android app was one example, where it was a tremendous win for them. So it’s very dependent, you’re right. And Jonny, that was very well said. I do agree with you.

It kind of reminds me of the discussion around like microservices or anything like that, when people are like “Is it the right time to switch to them?” And you’ll see a million stories online about how it was successful, and other stories where it’s like “It was the biggest problem we’ve ever had because we’ve made that switch.” There isn’t enough data to really say “This is the way you should do it.”

But if you are starting a new company, I guarantee you can find 10 people who want to learn Go. They might not be experts in it.

That’s true.

That is true. That is true. Yeah, I think that is way more likely than to find actually fully-grown, so to speak, Go developers. That’s true. That’s true. Jon, did you bring an unpop?

I did not.

I’m too depressed with the snow outside… [laughter]

Maybe that is your unpop. There shouldn’t be snow at this time of year for you.

There really shouldn’t be snow at this time of the year. It’s not even Thanksgiving yet. Come on.

It’s a little early…

My wife now wants to get Christmas decorations out, there’s snow outside… It’s really hard to say no.

Right, right, right…

It’s basically Christmas once it starts snowing, right? That’s how it works?

Sort of… Honestly, the biggest reason why I say no right now is because I’m like “I’m too busy right now.” After Thanksgiving it usually dies down a little bit for me, so I’m like “Then I can help with Christmas stuff if you need.” I’m not trying to ruin the party.

This was a very nice episode. Thank you, Nishant, for coming and talking about the kinds of things you’re working on over at Pinterest, and the challenges the team has experienced, and your role in it… I wish you continued success and growth in your endeavor. It was awesome having you.

And Jon, thank you so much for being my co-host last-minute, always good having you. With that, we will say goodbye to the internet, and listener, we will catch you on the next one.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00