The massive bug at the heart of npm featuring Darcy Clarke (JS Party #282)

All Episodes

Darcy Clarke, former GitHub Staff Engineering Manager and founder of vlt, joins us to discuss a major bug in the npm ecosystem that he recently disclosed. We cover the bug’s timeline, nuances, and impact, all while setting some important context on npm packages, clients, and registries. Tune in to learn how to protect your codebase and gain a deeper understanding of this crucial part of the JavaScript ecosystem.

Changelog++ members save 2 minutes on this episode because they made the ads disappear. Join!

63 minutes
Recorded Jun 30, 2023
Published Jul 7, 2023
Download (61MB)
Transcript
🎧 17,152

Featuring

Darcy Clarke – Website, GitHub, LinkedIn, Mastodon, X
Amal Hussein – GitHub, X
Feross Aboukhadijeh – Website, GitHub, X

Sponsors

Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com

Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.

Typesense – Lightning fast, globally distributed Search-as-a-Service that runs in memory. You literally can’t get any faster!

Changelog News – A podcast+newsletter combo that’s brief, entertaining & always on-point. Subscribe today.

Notes & Links

📝 Edit Notes

Chapters

Chapter Number	Chapter Start Time	Chapter Title	Chapter Duration
1	00:00	It's party time, y'all	00:40
2	00:40	Welcoming Darcy	02:16
3	02:56	A massive bug	02:08
4	05:04	Ecosystem overview	04:25
5	09:30	But why?	04:29
6	13:58	Verdaccio	02:48
7	16:46	Why is this so broken	10:52
8	27:38	Timeline of the bug	14:02
9	41:40	Blog post feedback	02:05
10	43:45	Why, GitHub, why?!	01:26
11	45:12	Sponsor: Changelog News	01:33
12	46:44	How do we dig ourselves out	06:30
13	53:14	What the early days were like	01:49
14	55:03	What's next for Darcy	02:22
15	57:25	vlt (Volt)	02:20
16	59:45	Closing time!	02:05
17	1:01:57	Next up on the pod	01:00

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Hello, JS Party listeners. It’s a Amal Hussein here. We’re back with a super-important and very timely show. We’ve got a hot topic, and some like just incredible guests, and incredible co-panelists to help me unpack this… So with me on the panel today is Feross. Hello, Feross.

Feross Aboukhadijeh

Hey, Amal. How’s it going? It’s good to be back on the show in a little while.

Yeah, yeah. Like I said before we recorded, I was like “You’re taking a break from saving the internet security to come record a podcast”, so we really appreciate you joining us, Feross… And we have a very, very special guest with us today, Darcy Clarke. Hello. Welcome, Darcy.

Thank you. Thank you for having me.

Yeah, we’re pumped. So Darcy and I spent some time working together when I was at npm , we had some overlap, he stayed on after the acquisition, and continued shepherding the npm CLI and the community ecosystem… Darcy’s just – yeah, longtime contributor to the JavaScript ecosystem, has been really focused on developer tooling for many years… We’re just thrilled to have him on the show. But Darcy, I’m not doing your introduction any justice, so please, why don’t you tell us a little bit about yourself?

No, I think you covered most of it. I think it’d be writing code for almost two decades, and doing something in the midst of product strategy or engineering work. And most recently, was at GitHub for the last two and a half years, helping manage the npm CLI team. Before that I was part of the npm Inc, prior to that acquisition… And so really heads down and deep in package management, supply chain, and JavaScript’s ecosystem. Everything open source as well; our team supported over 100 projects, and roughly 3 billion installs a month, which is crazy. So yeah…

Yeah, that’s awesome. Thank you so much for sharing your, your background, Darcy. And honestly, you’re just also one of the most passionate human beings I know and I’ve ever worked with; you’re extremely passionate about the developer ecosystem, really passionate about community work… You did a ton of really cool stuff at npm , just starting to reengage with the community, whether it was kickstarting the RFC process, and all kinds of stuffs… So I just wanted to say thank you for all your contributions.

What we’re here to talk to you about today is a pretty massive bug that you reported on June 27th of 2023, just a few days ago. You reported a pretty big vulnerability at the heart of the npm ecosystem. Can you tell us a little bit about what that is? We’re going to set a bunch of context, but can you just give us like a few words for what the problem is?

Yeah, so I’ve coined it manifest confusion. I’m not sure if that’s the best way to interpret it. Feross might have a different way we should be maybe considering it, but that’s the name of the bug.. It’s just inconsistency between the metadata about packages and the actual contents of the tarball. So that’s the core issue. We can dive into that a bit more, but I would love to lay some context down around what exactly is the registry and the client itself.

Yeah. And to give some to meat to the problem, Feross, can you tell us why this is bad?

Feross Aboukhadijeh

Sure. I mean, I’m sure Darcy will have plenty of thoughts too, but if I had to sum it up, I’d say it basically allows an attacker to hide install scripts or extra dependencies inside of a package. A lot of tools won’t show those hidden install scripts or dependencies, even though they’re going to get installed and they’re going to get run… So it gives an attacker really like a pretty powerful tool to hide some of the stuff they might be up to. I don’t know if I did it justice, Darcy, but that’s kind of the –

That’s perfect. Yeah.

Yeah. So yeah, so if you’re anything like me, listeners, and you’re thinking, “Wow, this sounds like a really bad problem, and I can’t believe it’s still live, and it’s a thing”, we’re gonna get into why this – we’re gonna get into the timelines, we’re gonna get into the impact, and all of the things that led to this… But we’re gonna first set some context for you all, so that you understand the architecture of a package, and also the npm ecosystem as a whole. There’s a client, and a registry; there’s multiple clients, and multiple registries in theory. So Darcy, can you just give us the overview of this ecosystem, which maybe is very abstracted for most folks outside of the tooling world?

So the npm ecosystem is actually based off – the architecture is based off of a sort of decentralized model… Although the npm registry is where we look to for the majority of the packages in our ecosystem. And what we find is that there’s metadata components, as well as the actual artifact host components of the tarball that’s being hosted in the npm registry.

And so there’s sort of two pieces to the puzzle here. Usually, there’s the registry, which is the server that’s hosting the packages, and there’s also usually clients which interact with that. You can imagine these are just like SDKs, or in our case package managers that interface with those API endpoints… But there’s also a whole bunch of proxy registries that have been spun up in the last decade. There’s private registry hosts like the Artifactories, or Nexus [unintelligible 00:06:06.24] products… There’s also even an open source proxy registry project called Verdaccio, which some people have used to help with testing, or sort of mounting workspace or monorepo projects so that they can test sort of staged packages as if they were in the registry.

[06:26] So there’s a few different options for people out there if they want to host their packages somewhere other than the npm registry, but most of these third party tools actually do upstream and connect to the canonical npm registry. So they copy a lot of the metadata, they copy a lot of the artifacts and store them themselves, almost like they’ve cached the requests from the canonical version. This is no different than if you were to download and install and save a local copy of those packages. Ideally, you’d save both the metadata, as well as the tarball… So that’s sort of a high-level overview of sort of what the architecture is, and sort of the players, I guess, in this space.

Okay. So a developer has some code in a repo, and they have a pkg.json, and they hit npm publish. What happens next?

Quite a few things, and this is where things go wrong; or things can go wrong. So depending on your package manager, there might be a few steps that are taken prior to actually making that API request. And so this is really where I sort of dug in and started to see some problems with how we were doing this sort of handshake with the registry. In the case of npm and publish, in the way that the client interacts with the registry, we actually run a few steps: a pre-pack, a pre-publish, scripts can be run, and we will basically package up and create a few different values. An integrity value of the packaged tarball, and we’ll also walk through – there’s a couple of steps in between this, but we will walk through the file contents and your configuration for what exactly you do want packaged, and then we’ll also run some scripts if you’ve defined them for the different sort of hooks that we have within that process. And just before we actually push to the Public Registry, we’ll extract the pkg.json, normalize it, and then finally publish that alongside the tarball. So this is where you can start to see there might be an issue here, in the fact that there’s actually a difference between the metadata that’s being published separately from the actual tarball.

So here’s our first point of inflection…

Yeah, yeah. This is where you start to – when I saw this and I thought about this a bit more, it really came to a head. I was really concerned with what was and wasn’t happening on the other side of this dance. So the client, for sure, was where I was living and breathing for the last four years. I wasn’t spending a ton of time on the registry side of this equation… And so I got to get very familiar very quickly with what we were and weren’t doing on the validation side, and in terms of making sure that the metadata that had been presented to us was being validated against what actually was in the tarball.

Feross Aboukhadijeh

Darcy, I’ve got a question though… So why do you think this decision was made? I mean, what would be the reason that the registry would want to have a copy of a file that’s already in the package, and duplicate that information? Just, maybe it would be helpful to go through kind of why that might have been done.

Totally. So actually, in my research I went way back. You talk about code spelunking sometimes, trying to look at the history of how something was created… And I went back basically over a decade to the very beginning, Isaac Schlueter was working furiously on both the CLI, the registry, as well as was a very active and champion for Node itself. So he had a lot going on at that time. And I looked at the first few iterations of the registry side of this, or the registry client, and I think we’ll have in the show notes a reference to that actual registry client. And I think the reason why this was done was actually for performance.

[10:29] Also, at that time, over a decade ago, the ecosystem was really small. There was only a few trusted people publishing anything to the registry… So I don’t think that the idea that maybe there could be inconsistency from the client at all made sense. I don’t think that we were at such a smaller place in our ecosystem; we were just in the infancy. I don’t think that this decision that was made to upload these two things independently of one another, or have them be two independent pieces of data - I don’t think that was thought of as bad architecture probably back then… Because you couldn’t even imagine interfacing with the registry not with one of these clients; not with the npm client. And I think what has changed over the last ten years/decade is that there’s more clients, there’s more use cases to hit the registry itself without using sort of this privileged way of interfacing with the npm backend.

Feross Aboukhadijeh

It also feels like maybe part of it is that certain information is useful to just have without having to go into the package itself and pull it out. If you’re trying to build a website, like npmjs.com maybe, having the high-level info, like the readme and the list of dependencies just there, in an easy to consume format might have been part of it… Instead of having to unzip or untar a whole bunch of packages to generate those pages. I don’t know, maybe that came later. I’m not sure if that’s historically correct, but that was the reason I always assumed that it was done that way.

That is a super-valuable reason, to have that kind of information live alongside the package itself, and to be able to share some metadata. Like you said, you could have done it on the server side, and extracted it… So that’s why I say it’s probably a performance reason that they did this; it was easy and available to the client that had access to that metadata, so why try to extract it again on the server side; why go through that extra step. But you are essentially crossing that boundary and ensuring that there’s consistency with the data. It is kind of crucial. So it seems like it was a fundamental flaw in how that was architected.

Yeah, I mean, just relying on clients to do validation - it’s like one of the first things you learn when you’re a developer… It’s like “Don’t trust your client”, you know… So it’s interesting to see that we’re so trusting of our client to handle all this complex data validation. So just in terms of terminology - this is a manifest, right? This document that gets created. So there’s a manifest, there’s a tarball… So in terms of just the ecosystem, just to give you all examples - there’s things like Yarn, or Pnpm, clients that connect in with whatever registry you pointed to… And there’s tools like Verdaccio… I was very close to saying Versace. Didn’t say it. I don’t even know if I said it correctly. Is it Verfaccio? Versaccio?

As far as I know, it’s Verdaccio. Yeah, you said it right.

Okay, thank you. Not Versace. [laughs] Yeah, could you tell us what tools like Versaccio are, Darcy? [laughter]

Feross Aboukhadijeh

Versaccio…

Did I say Versaccio? Sorry…

[13:55] That’s great brand association, I’m sure. So a tool like Verdaccio actually connects upstream, and caches. It acts as a registry proxy. And it helps also teams that are doing things with workspaces, or they just want to have their own instance of a private npm registry. It sort of has backwards-engineered the APIs of the npm registry, because that’s one key thing here as well with this discovery, is just that there’s just so much undocumented behavior and undocumented APIs with the npm registry, which unfortunately has been the case for quite a while, and we’ve as a community really had to leverage the clients themselves, and folks that have paved paths to playing with the registry to actually figure out how to interface with it… And that’s the one unfortunate thing, is I think this would have been caught a lot earlier if those APIs were better documented.

So I tip my hat to the team behind Verdaccio… They did a lot of work to sort of backwards-engineer and build a proxy registry implementation. A lot of teams use that to self-host and serve packages. But again, it’s very similar to JFrog’s Artifactory products, and there’s some new players in the space like Cloudsmith that are also helping teams, enterprises have private registry instances.

And so the interesting piece here is that you can configure your client to connect to these proxies, and then they will essentially hydrate their states and store any packages from the public registry you would be requesting. So they actually copy the metadata, or the manifest that we’re talking about, as well as the tarball, so that you don’t make two round trips every single time; they help you cache and store… They also provide insights, so similar to, I’m sure, what Socket is doing; there’s some insights that they add to these products that help you understand a bit more about your consumption. A lot of these products also help you write policies and enforce policies for the consumption of your upstream packages.

Yeah. All the things that, in theory, in a perfect world, should be built into the npm registry. [laughs] That’s the thing - when I worked at npm, that was so painful. It’s like “Man, there’s such a backlog of things that this thing should do, or could do”, and there’s a whole ecosystem of tools, very profitable companies that are built around all the gaps in the ecosystem. Even just some things like Unpackage, like what Unpackaged did. That’s something npm should have very easily been able to just do, right? But it’s just, again, through a myriad of reasons, which we’ll get into very soon. Yeah, it’s like the little engine that could and should, but never did, you know.

So maybe this is a good time to segue into that… I really wanted to talk about mirroring, and all this… How do companies actually set up that that infrastructure, large companies like Google, or Stripe, or whatever; people have mirrored registries… And so I’m going to put that nerdly quest on the side. So let’s get into why is this so broken? Why is this so bad? There’s just been a myriad of issues with the npm registry, and just whether things are not documented… There’s just been a whole series of things. And so what are your thoughts on this, Darcy? Because I have my own, so…

Yeah, I have many thoughts. I’m not sure – I must be careful with how I maybe bring up my thoughts.

Feross Aboukhadijeh

Phrase them.

Yeah, how I phrase them. But I care very deeply about the JavaScript ecosystem. I think everybody that’s come in and through npm really has had that same passion about building a great product. So it is unfortunate that we weren’t able to capitalize on maybe some of these opportunities, but it’s awesome that companies like Socket, and Feross are doing amazing work to sort of pick up where there’s these gaps.

[17:58] In terms of the investments that are being made today, I can’t speak to the intentions behind the Microsoft Engine, or GitHub, broadly… But I think that they’ve seen that there’s a lot of work to be done in the developer space, and sort of code creation. GitHub, if you look at the entire platform - it’s amazing at building collaboration tools, and building source code management tools… But I don’t know if they really understand distribution that well, or that sort of other side of the coin. Creation and consumption are sort of two sides of the same problem. I always looked at npm as sort of being the platform for distribution. And so I think that maybe there’s some more investments that need to be made, and more sort of cross-collaboration that can be done with the products that now GitHub owns, that would really benefit us as consumers.

But yeah, I don’t know exactly why there’s been so many issues… I think we’ve had some exponential growth, which I know you want to speak to… The JavaScript community is by far and large the majority of the repositories that you see on GitHub; it’s the largest ecosystem, or software index and registry in the world. So you saw the exponential growth. So we are the first ones to, I think, experience problems with supply chains, and the problems with how we orchestrate massive dependency graphs, and how you really try to figure out how to manage this web of trust… And so our ecosystem, I think, is going to be the one at the forefront of figuring out what good tooling looks like… And a lot of folks have come to me in the last few years and said “npm is actually a great package manager compared to what we have in this ecosystem or that ecosystem.” And so they’ve actually been very excited to see you.

It’s funny, because we always complain about our tools in our ecosystem, and everybody’s ready to jump onto something new… But we are actually in a very privileged position, just with how much we care about solving these problems and how we’re always willing to keep pushing, even though we’re on the edge there. So I would say it’s a unique position we’re in, and I’m not sure if that speaks to why we’re this way; maybe Feross has other thoughts on that.

Feross Aboukhadijeh

I mean, I just want to second what you just said about how npm is actually pretty great, especially when compared to what came before it… I mean, it’s self-dependency hell, in the sense that you can install any set of dependencies, and the package manager will never tell you “Hey, you have two different versions of the same package. I’m sorry, your whole project is messed up now and you can’t proceed.” It also made publishing really easy and welcoming, and that’s why we have such a huge community. I think it was brilliant to make publish be such a core part of the package manager and to get rid of the gatekeeping process, and just let the creativity go wild, and let everybody just put – I mean, even putting non-JavaScript code on npm. I think for a while Substack was putting C code on npm, and people just used it for all kinds of really amazing, innovative things. I mean, it hosts frontend code now, which was not really the intention originally. And now WASM – I mean, it’s really incredible.

I think as we get into all the kind of problems with npm, we shouldn’t forget actually how amazing it was, and how many good ideas were in there… And also just seconding your point about how JavaScript is – we’re the biggest ecosystem, so we’re gonna face all the problems first. That doesn’t mean that – you know, there’s a lot of people that like to jump on that and say “Oh, look how doomed the JavaScript ecosystem is” or “Look how bad it is because of this reason, or that reason.” And certainly, with the supply chain attacks that happened - they tend to come in the JavaScript ecosystem, and PyPy now a little bit more… And it’s not necessarily because there’s anything wrong with those languages or those communities, or it’s really just the size. It’s like, why does Windows get all the malware? I mean, it’s definitely a part of the story that the haters, I guess, like to leave out. So yeah…

[22:14] Yeah, for sure. Haters are always gonna hate. But thanks for setting that context. I think for me there’s just a ton of – just at the rate that we’re able to innovate, I think that it’s just cultural, it’s just part of the JavaScript community’s DNA, as innovation is just core, and rapid innovation at that… It’s like “I can do this. Let’s just do this.” “Okay, it’s done. It’s published. Go use my idea.” That’s just core. For me, it’s just that the growth of the Node community is just for me a huge contributor to what we might see as feature gaps within the npm ecosystem. Just such a small team managing a huge hockey stick ride, just in the course of from 0 to 10 years, going from zero to serving billions of packages a month… I mean, that’s something that every single person that’s worked on the CLI or registry should be incredibly proud of.

So I think what we see as gaps are really just – for me, I see them as a result of just poor management, leadership, resources, resource constraints, lack of maybe strong community engagement early on. Having something like the Node steering committee handling the npm ecosystem in a more neutral way… There’s so many things. But yeah, so I think, before we get into some of these problems, just some context there and some empathy for the team… But the problems are real.

First of all, for me, the biggest shocker is that things aren’t validated. Pkg.json content doesn’t go through a validation. That’s just wild. You could say that my repo lives here, but it doesn’t actually have to live there. You could say your license is this, but it could be something else. Very little is required, beyond the name, and I don’t even know what else. So just stuff like that, and then the lack of documentation and the APIs – not lack of, it’s just no documentation of the registry APIs, really. That’s just like wild. People having to kind of – yeah…

Yeah, what I’ll say as well, in terms of like the people aspect of this equation… Definitely in terms of GitHub going through significant changes in the last four or five years, post acquisition of Microsoft, many great people have left, unfortunately, GitHub, and moved on to start new businesses. You sort of get that usually post acquisition. Same thing happened with the npm acquisition; quite a few folks that were a part of that organization are no longer there… And in fact, part of the timeline that I have for this issue has a critical point there where we see layoffs at GitHub, and even this past week you see that there’s been some instability in their platform.

So I think as you have that churn, as you have great people come in, but then leave, you start to get concerned maybe about operational excellence, you get concerned about what it looks like for the future of certain products… Are they critical to the long-term roadmaps of product leadership… So those are all question marks, I think, in my mind, long-term, for npm being at home at GitHub, unfortunately… But I am excited about what’s next for myself, for folks like Feross as well… And yeah, we can speak a little bit to that timeline, but also willing to jump to –

[25:49] Yeah, yeah. No, I appreciate that. And I guess the last thing I’ll say on this is that… I was part of npm, and was part of the group that was part of the layoffs… Maybe you are familiar with this or not, the acquisition was a little bit bloody. A lot of people were put either on contract, or laid off… So it was like – what was it? “You don’t have a job, or you’re not going to have a job soon?” That was the kind of stance…

Start a new [unintelligible 00:26:09.25]

Yeah. For the majority of folks. Fortunately, Darcy being the face of the community, was one of the few people that didn’t have to go through that. Really happy, because –

Oh, I interviewed…

Oh, you did? Okay…

And I think I told you this in private, but also - I took a pay cut as well. So if anybody’s out there thinking that it was a great acquisition for everybody involved - it wasn’t necessarily.

Yeah, it was rough. It was a bloodbath. And thanks for sharing that on air… But for me, the silver lining of this was “Whoa, at least npm is now going to be part of a large company, with a big set of resources, and support, and the infra to take in and rearchitect and reinvest…” I was like “Okay, that’s the silver lining here.” So let’s focus on the big picture; big picture is hopefully this is good for the ecosystem. And to find out, just through the grapevine, that no, all the same problems still exist, and actually they’re kind of worse now, because we have even less subject matter experts on staff, and the rotation of people… Like, how many times have they changed teams now? Three?

Roughly three, yeah.

Yeah. And now there’s a skeleton crew managing the entire ecosystem… Seven people, who are just – anyways, it’s very disappointing. So let’s get into this bug, and the timeline behind the discovery of this issue. So can you walk us through?

Yeah, I can walk through the timeline. I would love to also compare notes. I know Feross and the Socket team also had, it sounds like, some independent research done in this space… But what I’ll say is, it seems like this bug could also be coined a feature. It’s been around since the beginning of time, at the very beginning of the registry… So I just want to be mindful of the way that I talk about the timeline of discovery is through the lens of my discovery and research of this… So there’s probably other folks, and in fact, I’ve talked to many other folks that have said they’ve independently seen this issue, and just didn’t realize maybe the scope and impact that it has on security tools, insights, and what it could mean if bad actors took advantage of this.

So I just want to preface that anything that I say next in terms of timelines, that this was my own independent understanding of how I came to this. But yeah, in July 28th we actually saw the npm CLI team had an issue open against it, public issue by a user complaining about binding.gyp errors, and essentially install script errors, and binding.gyp sort of inconsistencies… And basically saying that there was or wasn’t inconsistency when I think the node.chip script was being run, when they saw or didn’t see a binding.gyp file in your package.

So this bug - I think I’ve linked it here - was initially triaged by somebody on my team, Michael Garvin, on October 22nd.

And to be clear, this is 2022. I just want to make sure that we set a preface for the –

Yeah, sorry. Hopefully people are listening back and we haven’t solved this problem in a decade from now, but…

Oh, God… [laughs]

So about a few months later, because unfortunately we had such a backlog of issues with npm and the npm CLI team, we really went through it… But one of my team members, Michael Garvin, actually triaged the issue on October 22nd, and initially thought the person was sort of referencing just the fact that they can ignore scripts, or sort of turned on and off the ability to actually run lifecycle scripts, which are typical within the process of installing a package. We run a bunch of scripts very similar to the publish process, where we run a bunch of scripts that users can define and do certain things pre and post-publishing.

[30:20] And so he brought that actually up to the team, and we also all looked at it independently… And what I realized pretty quickly on my own - that this likely was an issue, a more broader-scoped issue; that there seemed to be problems with how we were caching, or essentially rehydrating the state of the metadata, the manifest that we were holding around in memory… So the problem seemed to be if you had flushed your cache, you didn’t hold on to that context, the metadata that you had gotten from the registry, and instead, we were hydrating it from the local pkg.json. Now, we thought those two things should be the same. And we were making the same mistake that I think a lot of people make, and we were all under the assumption that those two things would be the same. So that’s where I saw some inconsistency.

So on November 2nd, so about a week later, I actually wrote the first POC, our proof of concept, and published it to the registry. So there’s a package in the registry called darcyclarke-testing-malformed, and you actually can see that that was published on November 4th, and that’s actually the same day – or sorry, November 2nd; and then a few days later after publishing that I internally wrote up a post about how I thought this might be a significant issue, we should probably look into this, and let the rest of GitHub know. Unfortunately, I decided to quit GitHub shortly after filing that internally. I left GitHub December 2nd, so that was my last day, for a number of reasons…

It was just because of this bug, really. That’s why. That was the primary driver. It was like “You know - I’m out. I’m out.” Mic drop. Boom. [laughs] Like, “This is too much.”

One bug too many. This was the one that broke –

This was it, yeah. It was issue number 6666. That’s it. Yeah. Anyways…

Yeah. So I had decided to leave, and there was a number of reasons why there… So I left in December, I took some time off in December, and then in the new year I began to wonder what had happened to that issue. I started to do some independent research again… There was actually a number of issues that were public on the npm CLI repo, but also public across all the package managers. This inconsistency was creeping up in weird ways. It is actually the cause for many bugs that people don’t even realize it’s the cause for…

The mismatch between local pkg.json and manifest – what’s actually stored in the manifest…

Totally.

Okay.

Yeah, the assumption really that we all have had is that if you save back down a tarball, and you actually extract it, and you put the contents onto your system somewhere, the idea is that you have almost all the information that you need about that package to use it and consume it. Unfortunately, this API, the way that it was built now says is actually you need two things. You need both this manifest and this metadata, and the tarball, and you have to carry those around all the time. And it also calls into question just which one is the source of truth. So that’s why we can tell there’s a code smell here in terms of the architecture… Because really, you should be able to hydrate the state of most of that metadata by just reading from the local pkg.json. Like, what dependencies, what license, what scripts, what’s the name, what’s the version… Those are two critical pieces of information which actually can be falsified inside of a tarball.

So hitting on sort of where I was at, roughly around March I decided to look into the problem again, and actually realized that the scope was a lot broader than I even initially thought. I realized all clients are basically affected by this. The third party tools we were talking about before, the proxy registries, like the Artifactories, and Nexus, and Verdaccios, who are all copying, and as you said, sort of mirroring the registry, are copying and cloning this inaccurate information.

[34:31] And so there’s really a whole bunch of caches out in the world that are hosting inconsistent data. And so it’s a really serious and significant issue. And if somebody, a bad actor finds this, they can sort of find a way, as Feross so elegantly put it, hide malicious scripts; hide known malware, known malware dependencies in a tarball, and not get flagged by security tools, not get flagged by advisory tools. And so this is the serious issue that I realized at that point.

So the timeline goes on - March 8th, I actually uploaded a new POC, which started to play with some of those values in terms of the scripts and the dependencies, as well as the name and version, to showcase that actually there was a whole series of issues here, including downgrade attacks, cache poisoning, and a number of other issues. And as of March 9th, I submitted a new HackerOne report to GitHub, to let them know about the scope and all my research. I’m not sure, Feross… I saw you shaking your head. Do you want to jump in there at some point?

Feross Aboukhadijeh

No, no. That all sounds good.

Oh, okay.

Feross Aboukhadijeh

I mean, I can just say from the Socket side, we independently fixed this bug back on September 5th, when we were refactoring some code… It just sort of came up like “Hmm… There’s a different set of information in the registry than there is in the package. What should we use?” And then we decided to go with what the CLI tools use, which is what’s in the package… And so we kind of just made that fix as part of a broader refactor that we were doing to our install process. So once I heard about this issue from you, I was glad that we were using the right data as far as it goes from a security perspective. But even we didn’t handle everything perfectly; our website actually was using the metadata from the registry… And so the website was showing as if we were unaware of these dependencies that were hidden in packages, even though our security analysis was actually using the correct manifest file under the hood.

So that just shows even – you know, we take security super-seriously, and it’s the whole point of our product, and even we used two different data sources. Fortunately, the website is not as important as the actual security analysis, but it just goes to show you how insidious this inconsistency is, and how basically every tool in the ecosystem has to deal with the differences now.

And I also just wanted to add too how grateful I am that you’re raising awareness about this issue. For those who don’t know, Darcy has been talking about this – not this specific issue, but this sort of general issue of different tools treating dependencies differently, to the point where… I think I saw in your most recent talk at the Open Source Summit in Vancouver this amazing slide, which shows when you go to install a particular dependency, just the number of dependencies that actually get installed varies so significantly between different package managers… To the point where you wonder, “Are we even running the same package?” I mean, it was hundreds of packages just get installed in npm, that don’t get installed in yarn, or yarn installs a different number than Pnpm installs… And it’s like “What is going on here? How does software even work when the numbers are so different?” That’s the feeling that you get from it. And it’s something that I’m glad you’re raising awareness about… Because we need to know what we’re talking about when we’re running software, like - what are we actually running? And there’s just so little understanding of what should get installed. I mean, there’s no standards around this stuff.

[38:17] So I think you’re raising it to the surface in many ways, not just this manifest confusion attack, but also just how the tooling needs to get all on the same page in a lot of ways. And also just honestly how sloppy some of the security tooling out there really is. I mean, people aren’t even – I mean, I think you mentioned in your post all the different tools that were affected by this, and it’s really just something to see, I think so. Anyway, really grateful for you for surfacing these issues.

Yeah. Plus one. It’s not easy… I think for me what was really insightful about your timeline is that you submitted this report to GitHub - like, you did the thing, and what happened next?

Yeah, so I privately disclosed, I wanted to do the right thing and make sure that I could collaborate. I was really concerned about the scope… And companies like Socket - I wanted to reach out; I believe I did actually reach out and disclose with Feross actually back in December. So it was shortly after I had left GitHub, I actually reached out privately and also disclosed with Feross. Obviously, it sounds like they were already protected.

So I disclosed, it was March 9th, they left me hanging in for a couple of weeks… And by March 21st they got back and close the ticket actually saying that they were going to handle it internally, which I was kind of disappointed about, because I was really hoping to collaborate on reaching out to, I think, the affected parties, to privately disclose, if possible… It just seemed like there might be some key players in the space that want to know about this… Or sort of like GitHub and npm be the first to sort of announce this.

So yeah, there was opportunity there for them to do that, and they didn’t; they decided to close it. And so about a week later after that happened, surprisingly enough - I’m not sure if it was a coincidence - but GitHub actually laid off their entire engineering team in India. And I don’t know if it’s well known or not, but the majority of the registry team had actually been moved to India at that point. So when you’re talking about the small team that exists now that’s supporting that infrastructure - there was a major layoff there for those folks.

So I got a bit concerned when I saw those layoffs; I got to be concerned about the timeline, that I could expect them to really work… As they said, they were going to work internally, and they weren’t going to provide updates to myself, through the HackerOne report, or anything. They obviously weren’t willing to collaborate. And especially given how much time I had invested in the research, and done significant work to figure out who I thought might be the key folks, it was super-disappointing.

So I waited roughly three months. I sat around for roughly three months on this before I finally decided to announce this past week, and write the blog post and share it with folks. The announcements that I was starting my new company, Vlt, happened on Monday. And then Tuesday, we published on our blog the article. Critically enough, it’s the first article on the blog post; I thought it was that important that we get it out there, and I appreciate folks like Feross helped to work with some media, and we got some buzz around this and trying to make this visible to the ecosystem so that they can protect themselves.

Yeah, thank you so much for that. And I know the blog post has made its rounds; it’s getting a lot of eyeballs. I’m curious to hear what feedback, what are most people – I’m sure they’re just shocked, like “Ha, really?” But what’s been the feedback on your blog post?

[41:57] So there was some follow-on blog posts; I know Feross and his team had a follow-up one as well to provide a bit more context clarity about the work that they had done. The feedback from my side was just “Wow. Oh my gosh, another issue.” A lot of people, I’m sure, have a bit of fatigue at this point about the number of issues that unfortunately our community has faced and the registry has faced, and the CLI has faced. I’m sure some folks are tuning out, just the noise… And I hope they don’t do that. I think this is one of those critical, really, really critical issues that fundamentally is going to be tough to fix. I actually have a lot of empathy for npm and GitHub in this space. I hope that came across in the blog post. There’s a section in there where I say “This is not going to be an easy thing to fix, just because the validation has not been in place for over a decade.”

So we have a lot of packages to go through, and double check their name and version on every single one of them, because those two pieces of information can actually be, like I said, falsified or changed, and yet the rest of that information in that pkg.json inside the tarball should be probably the canonical source of truth. I say probably because I’m not the person to enforce it until I publish my new product, so…

So, of course there’s so many things that come to mind here, one being it’s so disappointing to see such a large company, with so many resources, just kind of really mishandle the reporting of this, and just even taking ownership and trying to resolve it and address it. And so any thoughts on this? I mean, I kind of want to shake GitHub and be like “Why?!” But obviously, I can’t do that. They have tons of money, they could prioritize this work, but they don’t. And if you don’t want to take care of the ecosystem, the socialists in me is like “It should be a foundation, or it should be centrally managed”, or whatever else. But this is just – I don’t know. For me, this is unacceptable.

Yeah. It’s definitely tough to think about a future without some canonical or sort of centralized registry, unfortunately… Because the costs are pretty significant, maintenance is pretty high… So again, I do have a lot of empathy for the folks that do maintain the infrastructure… But you’re right, in October GitHub announced they had hit 1 billion ARR. They are making 1 billion USD in revenue, and the investments that they make should be telling to the ecosystem about what they do and don’t prioritize… And it’s unfortunate that there’s a lot of these things that go unprioritized, or they’re slow to react to. Again, as far as I know, it’s more than six months they’ve known about this, so…

Break: [45:13]

Alright, well, we’re gonna have to move this into a positive direction, which is – how do we dig ourselves out of this hole, Darcy? What’s being done to resolve this? And Feross, obviously, you’re resident expert on supply chain and security, and all that jazz… So I’m very curious to hear from you two, how do we get out of this? Because for me one answer is specs; ECMAScript is the standard, lots of different engines developed to that same standard, so there’s room for all these tools in our ecosystem… Let’s just have a spec that we are all developing up against. That’s one. But I don’t think that’s going to solve everything, so… What do you all think?

Feross Aboukhadijeh

Well, I’m just curious what Darcy thinks about – I mean, what would break if GitHub just did the obvious thing, which is to validate that the metadata in the registry is the same as the package? I mean, do you know of anything that would actually break? Isn’t that just like the most straightforward solution here, to just make sure that that reflects the actual contents of the package, and then we can just move on from this? Why isn’t that the easy solution here? I’d love to understand what you think.

Sure. So with most easy questions, there’s a complex answer, or a nuanced answer… [laughs] Unfortunately, I don’t have an easy answer for you. After spending four years, or almost four years managing the world’s largest package manager, really supporting the JavaScript community, you learn that even the smallest fix affects someone in a big way. Or it might affect somebody’s production system. We learned very quickly, my team, or myself when I on boarded to npm, that even making a bug fix might break someone, because they were relying on the bug. Right? So we learned very quickly that actually the ecosystem has learned to grow around the tooling, and they actually have started to rely on the bugs in the infrastructure and in the tooling.

So I’ve talked with some engineering folks - I won’t name names, but large engineering organizations that are actually very concerned about npm fixing this, because it might break them in nuanced, weird ways, because they might be relying on the inconsistent behavior. So that’s one concern.

So I think that to do the easy thing still requires a lot of communication, maybe a lot of lead way for organizations to get prepared for that… And it’s not necessarily that easy. Like you said, the path forward should be to ensure that maybe the tarballs pkg.json is the canonical source of truth for that metadata that we actually rely on at the API level, and should be validated or extracted out of that tarball, and you should almost not even be required to pass a manifest, because there’s no point, really. I think that is the path forward that they’ll take eventually, that npm probably should take… But there’s a bunch of discovery work that needs to happen. And maybe your team could do this, to basically validate the existing 3 million plus packages to see whether or not the contents of that package, specifically the name and version, are not aligned with the name and version that that package was published under.

[50:05] So that’s one of the key issues that I would say is at play here, is that you could actually falsify the name and version in your pkg.json in a tarball today, and that does not get validated by the registry. And there’s some weird – the package managers themselves, the clients, as we were talking about, handle those use cases in very weird ways. So you can essentially coop or steal a package name, and do some interesting things in certain situations. So that’s one of those nuances of this problem that I think needs to be handled carefully… So I don’t think it’s a straightforward change that they’ll have to make.

And in terms of where we go going forward, I’m with you, I’m all for standards, Feross. To speak into this a bit, I care deeply about how we interpret your dependency graph, and I think that there’s an opportunity for us to get on the same page. I liken it to before there was standards in the DOM, and rendering your HTML markup, and browsers were all interpreting your HTML a bit different… And before HTML 5 there was no standard way of handling all the edge cases of broken markup… And I think that working with foundations, which I’m really excited to do with my new company - I work very closely with the Open.JS Foundation, the Linux Foundation, folks like the OpenSSF as well, that I know Feross is close with… Working with those organizations to standardize how we interpret dependency graphs I think is going to be very important. And that starts with standardizing some critical pieces of what a package is; like the semantic versioning spec I think needs some standard semantic versioning, or semver APIs would be great, a great first step in the runtimes… And then moving on to what exactly is a package specification? What’s that look like? It’d be great if Bun and Deno, the latest and greatest sort of quasi-package managers also participated in this, because as we grow the ecosystem, we start to see more and more implementations and nuances in your dependency graph, and it’s really hard to build security tools, and give everybody safety of mind that when they actually go to install a project that’s consistent, and you have consistency across your dependencies.

Yeah, absolutely. The HTML analogy was spot on; that’s exactly it. And I think for me, that is a result of just the lack of community hurting early on, and also just all of these different projects springing up and gaining different levels of traction within the community… So it’s really hard, I mean, even just like yarn becoming a thing. I don’t know what the backstory is, but I’m curious before they had released yarn, publicly, outside of Facebook - it was Facebook then - what was the engagement like? Because I know that you really helped bring community engagement along, and I got to see that as part of your leadership, bringing all these different stakeholders to the table and having a conversation… But I’m curious just what was it like early days, just bringing these different folks to the table to talk about getting on the same page?

I think yarn was launched in 2016, I believe… So I don’t quite remember, and that was before my time at npm. I don’t quite remember what the engagement was like with the community. But I do know that actually npm’s RFC process was mimicked a bit by yarn. Yarn had an RFC process, and that was also I think sort of copied from I think the IETF… Right? Is that the right standards body? There’s another standards body that has a really good RFC program, and sort of – and so that way of engaging with the community, through open discourse and through live streams, and having opportunities for people to engage in different ways with your open source project I think was really critical to hearing what people actually wanted from us… And I was really proud of what we had done at npm over the last three and a half, four years with the RFC program… We had over 100 hours of livestreamed videos and livestreamed meetings on YouTube there.

[54:27] And so we, I think, did more in that last little while to try to fix and correct that relationship with the community, and try to mend some broken trust… But obviously, npm hasn’t gone far enough. So I’m really excited about what I can do with my new company, and where we can go with that, and the kind of relationships we’ll be building hopefully are strong ones, with partners like Feross maybe, or open source folks that are looking to make sure we build for the next 5 to 10, the next few decades, look to build a good foundation.

Yeah, absolutely. It’s a really good segue to say to say - so what’s next for you, Darcy, as we’re kind of wrapping up this conversation? I mean, this was a deep one… Obviously, it’s hard to walk away from this conversation not really even having a clear solution and like a clear resolution on “Next steps. What are action items?” That’s what I want to do. Let’s get some bullet points going, and delegate some tasks… [laughs]

Well, I will say, there is some folks in the community… I actually saw someone had written a package validator package… I think I saw it today, or just yesterday. So I’ll definitely share that link with folks that are interested, just to see if you’re affected or if you’re interested in checking and validating the contents of a package that you consume, and I want to make sure you’re safe.

Obviously, my recommendation also is to go to Socket’s package pages now, highlight inconsistency… And they actually, I think, have introduced a net new type of issue in their platform that’s called Manifest Confusion. So my former link to my POC package page now has a bright red warning on it, which is great to see… So if you’re trying to secure yourself, you’re looking for next steps, I think Feross’ company’s doing a great job there, and was really on the ball when I announced this week. I’m not sure, Feross, do you know any other good next steps for folks if they’re trying to protect themselves?

Feross Aboukhadijeh

I mean, from this specific attack, I think the two things you mentioned are it. I mean, there’s that tool that somebody released, just to look for Manifest Confusion specifically… And then, you’re absolutely right, Socket can detect the Manifest Confusion issue now in any dependencies.

Other than that – I mean, we can also recommend probably that people should… If they’re using another type of tool for their dependency security, they should probably ask that vendor what they’re doing about this issue, and just make sure that the vendor isn’t using the registry metadata, and try putting Darcy’s package into that tool, and just see what comes out. Does it actually catch the hidden dependency? Does it actually catch the install script? I mean, you can test your tools out. So if you think your tooling is protecting you, you can just check to see if it handles Darcy’s test package. So I think we should link that in the show notes as well for people.

Yeah, all this will be definitely linked in the notes. Yeah, and thanks for that great summary, Feross. Very helpful. And so Darcy, we’ve talked a little bit about Vlt, in passing, you’ve mentioned it… Can you just tell folks what is Vlt?

Vlt - yeah, I apologize for the confusion, preemptively. I know there’s other tools in the ecosystem; people have already started to give me a hard time about the name. But Vlt is a net new package manager, a JavaScript package manager, as well as a new registry. So we’re going to be focused on competing with npm , but also ensuring that we don’t bifurcate the ecosystem; we’re definitely going to have the same capabilities that some of these other registry proxies have, and be able to upstream and make sure we bring the existing ecosystem along for the ride.

[58:20] So I’m gonna hopefully provide a ton of extra value, and this gives us some greenfield space in terms of what we can do with the package manager… And the hope is to also help with the standards efforts here as we go forward. So I’m really excited to get started. You can go to vlt.sh, sign up to be one of the first folks in our beta, and I would love to come back on and share more when we’re fully launched.

I also will be speaking at a conference a little bit more about it next month. I think Feross may be there… It’s a conference here in Toronto called RefactorDX, so focused on developer experience and developer tools. I believe we both have keynotes there, so that’ll be fun; or we both are speaking there… So I’m excited to share more at that conference. And I know that our good friend, Achmed – I’m not sure how you folks say it…

Achmed – you folks? Oh, man… That’s burn, BURN! That’s okay, that’s fine. Yes.

Oh, I meant just both of you. I’m not sure how you pronounce his name.

Oh, okay. Yes, no, I’m just kidding. I’m giving you a hard time.

He’s a good friend, former CTO of npm and advisor to my company.

Yeah, the person who also pressured me to speak at – well, not pressured… But at Refactor Conf, and unfortunately could not make it this year… But who knows? Miracles happen, so…

Hopefully you’ll be there in some way, shape or form…

In some way, yes. But yeah, so thank you so much for sharing all that, Darcy. Thank you for all the hard work that you contribute to this community… Everything we do is built on community in JavaScript. It’s like the core, it’s like the bedrock foundation of how we collaborate, create, distribute etc. And so just thank you for being a lighthouse among us. And I really appreciate you taking the time to come on the show and talk about this.

For those of you listening, the show notes are going to be packed with links, and if you have not given Darcy’s blog post a read, please do. It’s solid. It’ll also help you better understand all the different things that happen in that little black box called Node modules.

So with that said, we’ll close this out. Thank you again, Darcy. Where can folks connect with you and find you on the internet?

Sure. You can follow me on Twitter, just @Darcy, my first name, which is great. I’m up there with Jason Calacanis and other folks who have first name user accounts, which is great…

And Feross.

And Feross, yeah.

Oh, yeah. You guys are –

Feross Aboukhadijeh

Darcy is much more impressive of a name to get on Twitter than Feross, I’d have to say; there’s not that many Feross’es.

Yeah, you should just change your Twitter handle to @Darcy, not to be confused by Mark Darcy… Like, the famous Mark Darcy… [laughs]

Oh, yeah… I get a lot of Mr. Darcy… So there’s a lot of Jane Austen fans that love to hit me up on Twitter. So… Very interesting.

Mr. Darcy… Oh, yeah. I bet.

On GitHub I’m DarcyClarke, my full name… And if you’re looking for a very old website, Darcyclarke.me is my personal site. And of course, vlt.sh is my new company’s website. So you check me out there.

Hope to have you on the show once you’ve launched and you’re further in your product journey. It’s very exciting. Alright, kids, so with that said, we’ve wrapped another show. We will catch you next week. Thank you all for listening. Have an amazing day.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

View all episodes

Player art