On today’s show Nadia and Mikeal are joined by Andrew Nesbitt and Arfon Smith to talk about open source metrics, and how to interpret data around dependencies and usage. They talked about what we currently can, and can not measure in today’s open source ecosystem. They also talked about individual project metrics, how we can measure success, what maintainers should be paying attention to, and whether or not GitHub stars really matter.
Linode – Our cloud server of choice! This is what we built our new CMS on. Use the code rfc20 to get 2 months free!
Rollbar – Put errors in their place! Full-stack error tracking for all apps in any language. Get the Bootstrap plan free for 90 days. That’s nearly 300,000 errors tracked totally free. Members can get an extra $200 in credit.
Andrew Nesbitt is the creator of Libraries.io, and Arfon Smith heads up open source data at GitHub. Andrew’s project, Libraries.io, helps people discover and track open source libraries, which was informed by his work on GitHub Explore. Arfon works to make GitHub data more accessible to the public. Previously, he worked on science initiatives at GitHub and elsewhere, including a popular citizen science platform called Zooniverse.
Click here to listen along while you enjoy the transcript. 🎧
I’m Nadia Eghbal…
And I’m Mikeal Rogers.
On today’s show, Mikeal and I talked with Andrew Nesbitt, creator of Libraries.io, and Arfon Smith, who heads up open source data at GitHub. Andrew’s project, Libraries.io helps people discover and track open source libraries, which was informed by his work on GitHub Explore. Arfon works to make GitHub data more accessible to the public. Previously, he worked on science initiatives at GitHub and elsewhere, including a popular citizen science platform called Zooniverse.
Our focus on today’s episode with Andrew and Arfon was around open source metrics and how to interpret data around dependencies and usage. We talked about what we currently can and cannot measure in today’s open source ecosystem.
We also got into individual project metrics. We talked with Andrew and Arfon about how we can measure success, what maintainers should be paying attention to and whether stars really matter.
Andrew, I’ll start with you. What made you wanna build Libraries.io? How was that informed by your GitHub Explore experiences, if at all?
I got a little bit frustrated working at GitHub on the Explore stuff. It was me kind of deprioritized whilst I was there, and my approach of libraries, rather than just build the same thing again outside of GitHub, was to use a different data source, which started at the package management level, and it turns out that’s actually a really good source of metric data, especially when you start looking into dependencies. If I had taken the approach of, “Let me look at GitHub repositories”, I would have gone down a very different path, I think.
Right. So tell me a little bit about that. So you pull out the whole dependency graph data - do you go into the kind of deep dependencies, or do you sort of stay at more of a top layer of first-order dependency data?
So for each project, it only pulls out the direct dependency. But as it picks up every project, because every time it finds anything that depends on anything else, it will go investigate that as well. It ends up having the full dependency tree, but right now I don’t have it stored in a way that makes it very easy to query in a transitive way, if that makes sense. I’ve been looking into putting the whole dataset into Neo4j - a graph database - to be able to do that easy transitive query, and to be able to give you the whole picture of any one library’s dependencies and their transitive dependencies, but it’s not quite at that point. But I do have all the data to be able to do it.
Interesting. Okay. So you said that this is a much more interesting way to go about this in the GitHub data. What’s something that you found when you started working with the dependency data that you never had in GitHub Explore, or just to the GitHub data?
GitHub stars don’t really give you a good indication of actual usage, and GitHub download data is only really accessible if you are a maintainer of a project, rather than just someone who’s looking at the project from a regular browser’s perspective. If you actually look at the dependency data and not just other libraries that depend on that particular library, but if you look at the whole ecosystem and how many, say, GitHub projects depend upon this particular package, it gives you a fairly good idea of how many people are still using that, still need that thing to be around so that code continues to work. And if there was a security vulnerability, you can see exactly how many projects may be affected. So actually end up connecting the dots between… And I’ve only looked at GitHub data so far; I haven’t got around to doing Bitbucket or arbitrary Git repositories.
[00:04:21.05] But you can actually use package management data to connect the dots between GitHub repositories as well. You can say, “Oh, well given this GitHub repository, how many other GitHub repositories depend on it through NPM or through RubyGems.
It’s good to hear that stars are useless, because I’ve also thought that. [laughs] That’s been my assessment, as well.
Yeah, I’ve [unintelligible 00:04:46.03] over how you shouldn’t judge a project by its GitHub stars. There’s one particular project that’s a great example of that, it’s called Volkswagen. It is essentially a monkey patch for your CI to make sure it always passes. I think it’s got something like 5,000 GitHub stars, and it’s maybe downloaded 50 times on NPM; it has zero usage.
Yeah, that’s by Thomas Watson. It was a joke when VW had that scandal where they were just passing all their tests, so he wrote a module called Volkswagen that just made all your tests pass, no matter what. [laughs] It’s brilliant… But yeah, utterly useless in terms of actual usage.
Yeah, and if you actually look at the stars… Of course, people have contributed to it, but even looking at contributed data doesn’t give you a good indication of actually is this a useful thing, a real thing, and should I care about it? I always look at GitHub stars as a way of… It’s kind of like a Hacker News upvote or a Reddit upvote, or a Facebook like. It just means like, “Oh, that’s neat!”, rather than “I’m actually using this” or “I used to use this five years ago.” No one ever un-stars anything either, whereas if people stop using a dependency, you actually see the amount of people that depend on a thing go down.
I think stars are an indication of attention at some point in time, and that is all we can say about them. So if you look at stars versus pageviews on a given repo, they correlate very well. So in defense of stars, we shouldn’t use them as “This is what people are using”, but they’re a good measure of some popularity, some metric. And I think that’s exactly what you just said, Andrew. Consider it like a Facebook like, or something like that. It’s got very little to do with how many people are actually using something at any point in time.
Yeah. I saw someone actually build a package manager; I think it was only a prototype, but I really hope it never actually became a thing, where it would pick the right GitHub repository if you just gave it the name rather than the owner and the name, by the thing that had the most stars, which sounded like a terrible idea at the time and completely gameable.
Yeah, that doesn’t sound like a good idea. You mentioned something interesting, which was you can understand how people use it in terms of just it being depended on. Recently GitHub did this new BigQuery thing, and one of the results is that you can do - RegEx has done the actual file content of a lot of this stuff, so you can start to look at which methods of a module people might use or how they might use it. Could you get into that a little bit?
Yeah, so just to refresh the data that we put into BigQuery, it’s basically not only the event data that comes out of the GitHub API, which is just “Something happened on this public repo” - and that’s what the GitHub archive has been collecting for a long time - this is actually in addition to that, the contents of the files and all the paths of the files for about 2.8 million repos, so anything with an open source license on GitHub basically that’s in a public repo.
[00:08:15.07] So that allows you to do things like if there’s a particularly - maybe a method call in your public API that you wanna try and measure the use of, then you can now actually go and look for people using that explicitly. So currently really complex kind of RegEx stuff on GitHub searches is pretty hard; in fact, I’m not sure you can do a RegEx query on GitHub search, so that’s one of the strengths of BigQuery, that you can actually construct these really complex, expensive queries, but then of course that gets distributed across the BigQuery framework, so it comes back in a reasonable amount of time.
Yeah, for languages like C, that’s pretty much the only way to do it. There’s just no convention there, other than the language itself. And then for some other package managers, you actually have to execute a file to be able to work out the things that it depends upon, which I avoid doing because I don’t really wanna run other people’s code just arbitrarily.
Well, in the NodeJS project we’ve been trying forever to really figure out how are people using some of these methods, because if we wanna, say, deprecate something, we’d really like to know how many people are using that in the wild and to which level is it depended on. But we’ve had several projects where we tried to pull all of the actual sources out of NPM and create some kind of parse graph and then figure out how that gets used… It’s just such a big undertaking that it hasn’t really happened. When this BigQuery stuff got released we were like, “Oh my god, how far can we get with the RegEx to figure out some of the stuff that’s used?” because that’d be really useful.
Yeah, it kind of makes me sad that we’ve made everyone write crazy RegExes, but sorry about that. Hopefully, that will be useful. [laughs] Hopefully a bunch of good stuff can be done; people are gonna have to level up their RegEx skills, I think.
Just for people who are newer to metrics world, why should they care to be blunt about this dataset being open and being on BigQuery? What are some things that you expect the world to be able to do with this data? Even outside of people like Mikeal with Node, but policy makers or researchers or anyone else.
One of the things I think is incredibly difficult right now for some people is to measure how much people are using their stuff. For a maintainer of an open source project maybe that’s not a huge problem, because you can go and look at things like libraries and see how many people are including your library as a dependency, or maybe you can just see how many forks and stars you’ve got of your project on GitHub, but I think there are some producers of software where actually reporting these numbers is incredibly important, and Nadia, you mentioned researchers. If I get money as an academic researcher from a federal agency like the National Science Foundation or the National Institute of Health, one of the really important things about getting money from these funders is you need to be able to report the impact of your work.
[00:12:26.16] It’s currently kind of hard to do that if you have your software only on GitHub and you don’t have any other way of measuring when people use the library. You don’t have any direct ways of doing that, other than just looking at the graphs that you have as the owner of the software on GitHub. So I’m excited about the possibility of people being able to just construct queries to go and look… Of course, only open source, public stuff is in this BigQuery dataset, but I think it offers at least a place where people can go and try and get some further insight into usage.
I think it’s actually a hard problem to solve, but I know there are some environments - I’m trying to think of some large institutional compute facilities, big HPC centers… People have done some work, doing some reporting on when something’s being installed or run, and actually Homebrew I think have started doing that recently as well, starting to capture these metrics. Because it’s really tough to know; not everything that people produce is open source, so it’s not even clear that everything’s out there and measurable and available. It’s really tough if you need good numbers to actually say, “Who’s using my stuff? Where are they?”, and there’s lots of very legitimate privacy concerns for collecting all of that data. So yeah, it’s a hard problem.
So for you coming from the academia world, have you gotten requests from people from the scientific community around using this type of data? Did those experiences help inform the genesis of this project at all?
Yeah, a little bit. Very early on when I joined GitHub I got some enquiries from people saying, “We’d love to get really, really rich metrics on how much stuff is being downloaded, where people are downloading from…” - all this stuff that you needed if you had to report and you wanted really rich metrics. Some of those data we just can’t serve in a responsible fashion. There’s no way we can tell you the username of every GitHub user of your software, that would be a gross violation of users’ privacy on our part. So there are things that we just can’t do.
The other things is - and I think this is a kind of a pretty sane standpoint for us to take - we take very seriously user support, so if somebody comes to me with a data request, it may be ethically possible for me to service that, and it might be technically possible for me to service that. But if it takes two weeks of my time to pull that data, then we’re not gonna help them with that problem, and that’s because we kind of believe that everybody… We should be able to service a thousand requests that are coming like that; we should be able to give uniformly the same level of quality support service to people, so we generally try and avoid doing special favors, if that makes sense, in terms of pulling data. So this is why making it a self-service thing, getting more data out in the community, making it possible for people to answer their own questions is a much more scalable approach to this problem.
[00:15:58.08] I think the next step for me personally with this data being published is to start to kind of show some examples of how it can be used to answer common support questions that we see. I think that’s kind of the obvious next step from my standpoint.
And Andrew, you’re in a position where you’re actually taking a bunch of public data that’s out there in all these different public ecosystems and then kind of mashing it together, so you’re like your own customer for this data. What are some of the interesting things that you’ve been looking at? What are some of the most interesting questions that you’ve been able to answer?
Unfortunately I didn’t have access to the BigQuery earlier, so I’ve been collecting it manually via the GitHub API for the past year and a bit, which takes a lot longer, but it also picks up all of the repositories that don’t have a license, which I guess often it’s probably best not to pull people’s code out if they have not given permission to do that.
Some of the things that I’ve been able to pull out and have been quite interesting is looking at not only the usage of package managers across different repositories, but the amount of repositories that use more than one package manager, or that use Bower and NPM, or RubyGems and NPM, and then looking at the total counts of those usages, as well as the number of lockfiles, which I found really interesting.
Coming from a time working with Rails before Bundler, it was incredibly painful sharing projects or coming back to projects and trying to reinstall the set of dependencies that all worked, given the transitive dependencies that move around all the time with new versions. And it looks like the Ruby community is pretty much… For every gemfile there was a gemfile.lock, whereas for the Node community, there’s maybe kind of five, ten thousand shrinkwrap files that I’ve found on GitHub on public projects, compared to the nine hundred thousand package.jsons, which in the short term won’t be a problem, but could potentially cause Node projects to be very hard to bring back to life if they’ve not been used in over a year. Because trying to rebuild that transitive dependency graph may be impossible - or it may be really easy, it’s hard to know. But it’s quite interesting to look at how different communities take how their “How reproducible can I make my software?”
I think we’re heading into the break right now… When we come back we’ll talk about the open source ecosystem.
[00:19:38.23] We’re back with Andrew from Libraries.io and Arfon from GitHub. In this segment I wanna talk about the broader open source ecosystem and the types of metrics that are and aren’t available to people, because I’ve heard a lot of confusion about “Well, why can’t we measure what is being measured right now?” and I think both of you together probably have a good handle on that. I want to start with talking about GitHub data, since that was mentioned earlier, around download data and stars and things like that. Are there any sort of myths that you wanna address around the types of things that GitHub actually does measure or doesn’t measure?
I don’t think so. I mean, I don’t know what myths there might be. I would love to hear things that you’ve heard that you would love to know if they’re true. I don’t know of any kind of whisperings of what GitHub might be doing, so I’m happy to respond to questions.
I hear a lot around just download data, and whether GitHub actually has the data and isn’t sharing enough of it, why not use download data in addition to stars as something that people can see…
Sure… Yeah, okay. So there is a difference between what you as a project owner can see about a GitHub project and you as a potential user of that software. So there are graphs with things like number of clones of the software, which is I think a good metric, there are graphs for showing how many pageviews your project got actually, like a mini Google Analytics. So anybody who owns a GitHub repository can see those graphs. They’re not exposed to the general public, and I would like them to be; I think they’re useful. I think we were kind of cautious initially when rolling those out, thinking that was the kind of information that is something maybe that’s only relevant or appropriate for the repository owner to see… I don’t know, I think that data is generally useful for people to be able to see if… Andrew, you’ve mentioned before just the idea there’s a package manager that tries to suggest the correct GitHub repository based on just a name, and it does that based on stars - that’s not great, but at the same time when you are looking for a piece of software to use, if it has a bunch of forks and a bunch of stars and a bunch of contributors, then that helps you inform your decision about what to use, even if you haven’t even looked at the code yet, right? Personally, I use that information to help inform my decision.
I seem to remember the metrics weren’t exposed because of some of the referrer data potentially leaking people’s internal CI systems.
Yeah, that might be possible. I’m not hugely familiar with exactly why the data isn’t exposed right now. I think it’s important to remember that we take user privacy very seriously, so the thing here is you wanna be on the right side of people’s expectations of privacy. There are things that GitHub could do that would surprise people - and not in a good way - and we don’t want that to happen. So you’re always gonna see us on the side of reducing the scope of who could see a particular thing. That said, I think consumption metrics, fork events - we used to expose downloads. I think one reason we don’t expose downloads anymore is we actually just changed the way that we capture that metric, and it’s not captured in a way that is designed to be served through like a production service. It’s in our analytics pipeline, but it’s not in a place where we could build an API around it, it’s just not performant enough to build those kind of endpoints.
[00:23:47.15] So yeah, we capture more information than we expose, but that’s just a routine part of running a web application and having a good engineering culture around measuring lots of things. The decision about what to further expose to the broad open source community or the public at large is largely one based on making sure that we’re in line with people’s expectations of privacy, but also just based on user feedback. So if the stuff that you would like to see presented more clearly, you should definitely get in touch with us about that, because we are responsive to things that come up as common feature requests. That’s a good way of giving us feedback.
I think also any metric has to be qualified, right? A lot of this talk about stars is that stars is not an indication of quality, it’s an indication of popularity at a point in time, like you said, but people take it as that because it’s the only data that they have.
An example is in NodeJS we have metrics for which operating system people are using, so we always put out two data points. One is the operating systems that have pulled downloads of Node, either the tarballs or the installers of some kind, and then we also have the actual market share for the NodeJS website, visitors to the website. And those are two ends of a very large spectrum in terms of machines that are running Node and people that are using Node.
One metric that is huge on the people end is Windows, and incredibly small on the actual computer end is Windows. But we do a lot to qualify those before we put them out, to set people’s expectations about them.
Yeah, and there’s another thing… I think the Python package index has a similar - like a badge you can put on your profile. And you see this, people will put it, the number of downloads last month from the Python package index, and it’s exactly the same problem. For a fast-moving project where they’re doing lots of CI builds it might be 50,000 downloads last month, or something, and you’re like, “Whoa, that’s crazy!” and then actually there’s not that many users, it’s actually the CI tools that are responsible for most of those.
Yeah, the problem with download metrics on packages too is that you also get into the dependency graph stuff, right? Downloads are really good at looking at the difference in popularity between something like Lodash and Request. They’re both very popular, but the difference in downloads gives you some kind of indication of the difference. But there’s also a dependency of requests that’s only depended on by three other packages, that has amazing download numbers because it’s depended on by Request, right?
Yeah, I have one of those, base62. I don’t think there are many projects that use it, but it gets like one and a half million downloads a month because React transitively depends upon it, so it’s downloaded by everyone all the time. But it never changes, it’s never really used. Lots of people reimplement it themselves.
That’s funny. There’s a lot of packages like that. The whole Leftpad debacle was people did not know that this was used by a thing that used a thing that used a thing. It wasn’t that popular of like a first-order dependency, it just happened to be in the graph of a couple really popular things.
That’s one reason why I haven’t started pull download stats for libraries, because you can’t compare across different package managers either, because the client may cache really aggressively. RubyGems really aggressively caches every package, whereas if people are kind of blasting away their Node modules folder whenever they want to reinstall things, then the numbers - you can’t even try to compare them across different package managers. If you’re looking for “I wanna find the best library to work with Redis, then download counts just muddy the waters, really.
[00:28:01.09] I think a lot of the metrics fall into that, though. When you start looking at them across ecosystems, they really don’t match up. The one that I think of comparing a lot is Go and NPM. GoDoc is actually like a documentation resource, it’s not really a package manager, but people essentially use the index of it as an indication of the count of total packages. But that’s really like about four times what the actual unique packages are, which is an interesting way to go, and it’s one things that just doesn’t map up with the way that NPM or PIP do it. Not that it’s invalid, it’s just measuring something different.
Yeah, the Go package manager is slightly strange because it’s so distributed. It’s just, give it a URL and that is the package that it will install, so basically every nested filed inside that package could be considered to be a separate thing, because it’s just a URL that points to a file of the internet, as opposed to something that has been explicitly published as a package manager to a repository somewhere.
I’d like to get into the human side of this, too. You’ve mentioned this a little bit earlier when you were talking about the difference between NPM and Ruby in terms of locking down your dependencies. That’s not enforced by the package manager, it’s just now a cultural norm to use Bundler and not NPM. Are there some other people differences that you see between Go and NPM because of those huge differences? Or any other packet manager, for that matter.
I’ve tried not to look too much into the people yet, partly because I didn’t wanna end up pulling a lot of data that could be used by recruiters, and make libraries a source of kind of horrible data that would abuse people’s privacy.
I didn’t mean like individuals, I meant like culturally. I didn’t mean like, “Be creepy.” [laughs]
[inaudible 00:29:55.07] all kinds of horrible things. Nothing springs to mind… I guess you can look at the average number of packages that a developer in a particular language or package manager would potentially publish more, or the size of different packages. Node obviously tends towards smaller things, or a lot more smaller things. There are still some big projects as well, but it’s a bit more spread around, whereas something like Java tends to have really large packages that would do a lot of things.
I haven’t done too much in comparing the different package managers from that perspective, because it felt like… As you said, you don’t get much mileage from going like “What this thing compared to this thing?” It’s much better to look at what packages can we highlight as interesting or important within a particular package manager and see if we can do something to support those and the people behind them; so looking at who are the key people inside the community, and then “Are they well supported? What can we do to encourage or to help them out more?” as opposed to trying to compare people across different languages.
You definitely see a certain amount of people who live in more than one language as well. It’s not often that there’s people that are just only doing one particular language.
I’m curious whether there’s - I don’t know a whole lot about this, but if there’s any way to standardize how package managers work across languages, or just standardize behavior somehow. Because I just sort of think for people that are coming for this from outside of open source, but are really curious of, for example, what are the most depended on libraries that we should be looking at and trying to support those people. It seems like it’s just really hard to count… Every language is different, every package manager is different.
[00:32:14.20] Yeah. I’ve standardized as much as possible with Libraries. The only way I could possibly collect so many things is to kind of go, “Let’s treat every package manager as basically the same, and if they don’t have a particular feature then that’s just ‘no’ for that particular package manager.” If you ignore the clients and the way the clients install things and just look at the central repositories that are storing essentially names of tarballs and versions, then it’s fairly easy to compare across them as when there is a central repository. Things like Bower and Go are a little bit more tricky because they don’t have that… You end up going like “Well, we’ll assume the GitHub repo is the central repository for this package manager”, which for Bower it is, but for Go it’s kind of spread all over the internet; it’s mostly GitHub, but there is things all over the place.
But you can then kind of go, “Okay, within a given package manager, show me the things that are highly depended on but only have one contributor, or have no license”, which is easy to pull out in Go, but then “Order by the number of people that depend on it or the number of releases that it’s had” to try and find the potential problems or the superstars inside of that particular community.
Right. I can see you kind of standardizing the data and some of the people work, but the actual technology - or even the encapsulation - you eventually hit the barrier of the actual module system itself, right? One of the reasons why Node is really good at this is because NPM was built and the Node module system was essentially rewritten in order to work better for NPM and better for packaging. So a lot of the enablement of the small modules is that two modules can depend on two conflicting versions of the same module, which you can’t do if you have a global namespace around the module system, which is the problem in Python, for instance.
So there’s a general trend I think towards everything getting smaller and packages are getting smaller, but some module systems actually don’t support that very well, and you’re hitting kind of a bottleneck there.
Yeah, I don’t think there are many other package managers other than NPM that allow you to run multiple versions of a package at the same time, and partly because of the danger of doing that, that you introduce potentially really subtle bugs in the process. But most of the package managers in the languages that at least I have an experience with will load the thing into a global namespace, or the resolver will make sure that it either resolves correctly to only have one particular version of a thing, or it will just throw its hands up and go “I can’t resolve this dependency tree.”
Yeah, it’s important to note that’s not part of NPM, it’s part of Node. Node’s resolution semantics enable you to do that; it’s not actually in NPM. NPM is just the vehicle by which these things get published and put together.
I think there’s been valiant efforts to make an installer and an NPM-like thing in Python, and they eventually hit this problem where you actually need to change the module system a bit.
Yeah, I made a shim for RubyGems once that essentially did that and it made a module of the name and the version, and then kind of hijacked the require in Ruby. It was a fun little experiment, but ends up being… You’re just fighting against everything else that already exists in the community. So you kind of wanna get in early before the community really gets going and starts building things, because once all that code is there it’s really hard to change.
[00:36:00.08] In that vein, have you seen any changes across these module systems as they’ve gone along? Have any really spiked in popularity or fallen? Are there changes that actually happen in these ecosystems once they get established?
Not so much. Elixir is making a few small changes, but it’s more around how they lock down their dependencies. Usually once there’s a few hundred packages - and often it’s because I guess there’s just not many maintainers that are actually working directly on the package managers; often they’re completely overwhelmed anyway to be able to keep up and be forward-thinking with a lot of this stuff. And I get the feeling that a lot of people are building their package manager for the first time and kind of don’t really learn the lessons of previous package managers. CPAN and Perl solved almost every problem a long time ago…
…and these package managers go round and eventually run into the same problems and solve the same things over again.
Related to that - I’m curious for both Andrew and Arfon - when we talked about looking at stars versus looking at downloads, and looking at projects that are trending or popular versus ones that are actually being used, for someone who’s trying to look through available projects and fair out which ones they should be using, how should they balance those two ideas? Because it sounds like once an ecosystem gets established then nothing really changes a whole lot, so you could make the argument that just because a lot of people are using a certain project doesn’t mean that you should also be using it. It could also encourage a different kind of behavior, whereas if you’re telling people only to look at the popular ones, then that encourages a behavior of doing, “I don’t know, maybe it’s not the best project.” So how do you balance - should we be looking at which one is trending or new or flashy, versus something that is older but everybody is using?
Yes, tricky one. I’ve been kind of intentionally avoiding highlighting the new, shiny things in package managers for the moment, and kind of not doing any newsletters of “Here are the latest and greatest things that have been published.” I think this mirrors my approach to software at the moment, which is to focus on actually shipping useful things to solve a problem, as opposed to following whatever the latest technology is.
But that’s just my point of view. There are lots of people who are looking for employment and want to be able to keep on top of whatever is currently the most likely to get them a job, which is a very different view of “What should I look at? What should I use?”
Something I really struggle with software in general, you often hear people saying, “Oh, this project should just die, because it’s not following modern development practices, or it’s just kind of hopeless and we should just focus on whatever is new.” I think it’s because it’s comparatively easier to do that with software infrastructure than it is with physical infrastructure; they can kind of just throw something away. But there’s a part of me that’s also like, “Well, maybe we should reinvest in things that are older but that everybody is still using.”
Yeah, and sometimes it’s a case of people very loudly saying, “I’m not gonna use this anymore”, whereas there are a number of people that are just using it and not telling anyone, just getting on with what they’re doing. They still require that stuff. Often you see companies will have their own private fork, or they’ll just keep their internal changes and improvements and never contribute them back, because they’re just solving their own particular problem.
I relatively recently started doing some Node stuff and I wanted to find a testing framework; I just wanted to write some tests, and I ended up going through about six in about five hours and it seemed by my assessment of what’s going on, the community was moving so quickly - three of the frameworks are all written by the same person. They clearly changed their opinion and had a preference about the way that they were going to now work, but I literally couldn’t get… It wasn’t a very satisfactory experience because things were moving so fast.
I consider myself reasonably technical and pretty good at using GitHub hopefully, and I found it hard to find a good set of defaults. I don’t know, I think finding the right thing, it’s…
It’s very similar in the browser at the moment. It’s hard to know - is this library the right thing anymore? I find myself going to, and I use DotCom to work out, like “Is this mirroring and API that now is a standard, or has it moved on?” because the browser has been evergreen mix, everything really hard to… And you can’t freeze anything in time anymore with anything that’s delivered to a browser, because Chrome is updating every day almost.
Yeah, I don’t know… The other thing is if you actually went out, stick your neck out and say “You should use these things” then somebody’s obviously gonna shout at you on the internet and say “You’re an idiot. You should use this thing.” I think it’s hard for the individual to have a strong preference and be public about that. It’s an unsolved problem, I think.
The scary thing to me is that there is no correlation that I can find between the health of a project and a popularity of a project.
It’s totally fine if it’s not the coolest thing, but people are still working on it and it’s still maintained. But things actually die off and the maintainer leaves and it’s still popular and still out there, and still being heavily used because it’s that thing that people find. But as you said, that maintainer already moved on to a new project, didn’t hand it over to anybody, has a new testing framework that they’re working and doesn’t really care about this thing. So we don’t have a great way to surface that data or to just embed into the culture, like when you’re looking for something, look for health, and what does health mean to a project?
And making that argument to someone that… They might not care about the health, because they’re like, “Well, it’s popular and everyone’s using it.” I struggle with sort of like what is a good argument for saying “You should care about this” to a user.
Yeah, it’s a very long-term thing as well, because if you get an instant result and you can ship it and be done, you’re like “Oh, that’s fine, I don’t need to come back and look at this again”, whereas in six months, a year’s time you might come back to it and be like “Oh, I wish I didn’t do this.” But you have to be quite forward-thinking; especially as a beginner, that can be something that you just don’t consider, the long-term implications of bit-rot on your software.
Yeah, I feel like there was a thing relatively recently on Hacker News, like “Commiserations, you’ve now got a popular open source project”, or something like that. It was this really well-articulated overview of, so you publish something on GitHub; now a bunch of people are using it, and now you’ve got the overhead of maintaining it for all of these people that maybe you don’t really wanna help.
[00:44:06.17] For me that’s just a good demonstration of, you know, lots of people publish open source code, and they’re doing that because that’s just normal, or maybe they’re doing that because that’s the free side of GitHub, or whatever the reason is they’re doing that; or they’re solving probably their own problems - they were working on something because they were trying to solve a problem for themselves. If that then happens, to become incredibly popular, because that’s a useful thing and lots of people wanna use it, there’s no contract of “It’s my job now to help you.” There’s just conventions and social norms around what it looks like to be a good maintainer, but there’s no…
I think a lot of people who publish something that then becomes popular maybe don’t want to maintain it, or maybe don’t have the time to maintain it. Money helps, I think, but I think funding open source is hard; for lots of people it isn’t their day job to work on these things, and I think there’s not a good way yet - apart from the very large open source projects - of handing something off to a different bunch of people. I think that’s actually not very well solved for. You see Twitter do it with some of their large open source projects, they put them in the Apache Software Foundation, but that’s a whole different kind of scale of what it looks like to look after an open source project.
Nadia, you’ve written a bunch about this, I’m sure you’ve got a bunch of opinions on this as well.
I think that you’ve really highlighted the basis for the shift in open source, which is that we’ve gone to a more traditional peer production model. If you read anything from Clay Shirky about peer production, it’s like you publish first and then you filter, and the culture around how you filter and how you figure that out is actually the culture that defines what that peer production system looks like.
And in older open source, in order to get involved at all it was so hard, that you basically internalized all of that culture and then basically became a maintainer waiting in the wings, and that’s just not the world anymore.
People publish that have no interest in maintaining things at all, because everybody just publishes, that’s the culture now. I think we’re actually gonna come into a break now, but when we get back we’re gonna dive into what are those metrics of success, what are those metrics of health and how can we better define this.
[00:48:49.29] And we’re back. Alright, so let’s dive right into this. What are the metrics that we can use for success? How can we use this data to show what the health of an open source project might be and expose that to people? Let’s start with Arfon, since we have so many new metrics coming out of this new GitHub data.
Yeah, so I’ll start by not answering your question directly, if you don’t mind. One thing I would love to see is… There are things that I can do, and anybody who’s looked at an enough open source software… If you give somebody ten minutes, “Tell me if this project is doing well”, you can answer that question as a human, right? You can go and look at the repo, maybe you find out they have a Slack channel or discussion board, you go and see how active that is, you maybe go and look at how many releases there were, how many open issues there are, how many pull requests end up being responded to in the last three or four months… You can kind of take a look at a project and get a reasonable feeling for whether it’s doing well or not, and that I think is the project’s health. I think that’s what we can do as an experienced eye.
What that actually means in terms of heuristics, the ways in which we could codify that in terms of hard metrics, I think that’s a reasonably tough problem. I don’t think it’s impossible by any stretch, but it’s things like - we could make some up right now. Like, are there commits coming and landing in master? Are pull requests being merged? Are issues being responded to and closed? Another one I’m particularly interested in because I think this is pretty important for the story we tell ourselves about open source, the kind that anyone can contribute, “Are all the contributions coming from the core team, or are they coming from the outside of the core team?”
There’s one quote that calls this the ‘democracy of the project’. Is it actually - ‘meritocracy’ is a dirty word these days, but is it the community that’s contributing to this thing, or is it just three people who are actually rejecting the community’s contributions and are just working on their own stuff?
Is it participatory, right? Can people participate? That’s the question.
Yeah. How open is this collaboration, is the way I like to think of it. Because I think that’s the thing we tell ourselves, and that’s one of the reasons that I think open source is both a collaboration model and a set of licenses and ways to think about IP. For me, the most exciting thing about open source - and actually about GitHub - is that I think the way in which collaboration can happen is very exciting. You have permission to take a copy, do some work and propose a change, and then have that conversation happen in the open.
A lot of people do that, but they’re actually working in a very small team, or working together. Actually, a while ago I tried to measure some of this stuff on a few projects that I use, and you can see quite clearly that some projects are terrible at merging community contributions. They’re absolutely appalling at it. I can’t name names; some of them are incredibly popular languages.
You can name names.
I totally won’t, I’ll absolutely not. Some of them are very poor. But then actually, just to counter that, okay, so what does it mean if you are very bad at merging contributions? Maybe that means your API is really robust and your software is really stable, right? It’s not clear that being very conservative about merging pull requests is wrong, but it does mean that the community feels different. It does mean that the collaboration experience is [unintelligible 00:52:44.16]
That’s exactly what I wanted to tease apart a little bit. I just had a talk recently where I was looking at Rust versus Clojure and how both of those communities function, and they’re really different. Rust is super participatory and Clojure is more BDFL, but one can make the argument that both are still working, and Clojure really prioritizes stability over anything else, so that’s why they’re really careful about what they actually as contributions.
[00:53:10.28] So we talked about popularity of projects and then we’re talking now about health of projects, and it feels like two parts of it. One is around “Is this project active? Is it being actively worked on and being kept up to date?”, and you can look at contribution activity there. The other part is “Is it participatory or is it collaborative? Does the community itself look positive, healthy, welcoming?” But those are two pretty separate areas in my opinion.
yum, which is an even smaller number of people for the project that could actually publish whatever changes were merged in, unless everyone is literally pulling from GitHub directly, which I don’t think most published software happens that way yet.
My prediction here is that the people and the organizations that are gonna solve this are gonna be the ones that are paying most attention to business users of open source. Because if you are a CIO and you’re thinking about starting to use open source more extensively in your organization, then assessing the risk of that in terms of maintenance and service agreements and understanding of whether a project is - if it does have a security vulnerability that’s likely to be patched… It’s useful to know in open source generally. “Should I use this library because it’s likely to see updates when Rails 5 is released?” or “When something happens, can I use my favorite framework with this, or my favorite tool? Is that likely to happen?” That’s useful to know, but it’s not business-critical. I think the people who really want a hard answer to this are more likely to be business consumers. That’s my prediction. I think there’s actually a lot of opportunity to do good stuff in this space.
[00:57:12.00] The Linux Foundation are a little bit around that with the Core Infrastructure Initiative, where they’re trying to see, “Has this project had a security review? When was the last time it was checked for the people that are behind the project?”, which I think is a harder thing to do automatically. You end up having to have a set of humans that go and contact other humans, which if those people are anything like me on email, it may take ages to get a response.
There’s a fair number of metrics that we can pull in automatically to give you a light indication of if the project is healthy. I guess you have to split it in half again and go like, “Well, what do I care about the project? Is this thing that I’m doing a throw-away fun experiment or a learning exercise, or is it something I’m gonna be putting into production?” Then you have to look at things with two very different sets of metrics.
I think the methodology that they used is somewhat applicable here though. I know a lot about the CII thing because I’m at the Linux Foundation. The NodeJS project was one of the first to get a security badge. Essentially what they did was they came up with “How do we do a really good survey on projects that are problematic? Do they have a security problem?” They asked some of the similar questions that we did, like “What makes a project healthy? How do we define that?” Then they went out and did this huge survey to identify all the projects that are having a problem. Later what they did was they turned all of those things into basically a badging program. There was a set of recommendations that you can do, and if you do all of these things, then you get the security badge.
The Node project was one of the launch partners of this. It’s really simple stuff, like have a private security list, have a documented disclosure policy, have that on a website somewhere. It sounds really basic, but the number of projects that are heavily depended on that don’t do that is surprisingly big. And just having a really basic set of things that people can go do that make people feel better about their software and are actually good for the health of the projects is like a really good set of recommendations that we can come up with, that would actually be based on metrics and some really good methodology.
I’m curious to kind of move this a little bit to thinking about analytics from a maintainer’s point of view. So if you’re a maintainer and you have a project, the project gets popular, what should they be measuring for their projects? What do you think they should be paying attention to at a high level?
Someone asked me a question the other day on Twitter… They were wondering for a given library that they were maintaining what were the versions of that library that people depended on. They wanted to see for the 500 other projects that depended on it what versions were they using, because they wanted to get an idea of which things could they deprecate. As Mikeal said earlier, we wanna know the actual pain points here and if people are stuck on an old version, and how can we move them forward, so that we can drop some old code or we can kind of clean up something that we don’t like anymore. That data is very easy to get, although trying to lump that in together with SemVer ranges ends up going like, “Oh, they depend on something around this version”, as opposed to something very specific.
[01:00:59.03] But having that actual usage data around the versions, which some package managers really give you the data of a particular download for a version as well, so you can see, “Oh, this thing looks completely dead. No one has downloaded this anymore”, as opposed to the last two releases that are really heavily downloaded. And you can get that data from RubyGems. I don’t think NPM has download data on a per-version basis, as least publicly available. For other smaller package managers it’s kind of all over the place, whereas at least on GitHub you can assume everyone is looking at the default branch.
Then also looking into the forks is something that maintainers might wanna do to be able to kind of go, “Oh, people are forking this off and changing things manually. They haven’t wanted to contribute back? Why didn’t they contribute back?” It definitely seems to me to come down to very human questions, as opposed to kind of like “What versions of Node are people running when they’re using my library?” It’s more kind of like, “How can I help these people either move forward onto a newer version, or what are the exceptions that they’re having that I never see?”
I was talking to the guy at Bugsnag, who do exception tracking, and they collect a lot of exception data that actually is thrown up by an open source library and they see it in the stack trace, like “Oh, this error has come from Rack”, for example, and they were investigating if they could use or at least ask for permission for users to report that error, exception tracking data, like “This line of your source code is causing lots of people lots of exceptions, for whatever reason”, which I thought was quite interesting. I don’t think they’ve actually got around to doing that yet, though.
Yeah, I’m also interested in the types of roles of people on your project, as well. One of the projects I maintain for GitHub is called Linguist, which is actually one of our more popular open source projects, and it does the language detection on GitHub; it’s kind of a somewhat self-serviced project, like if a new language springs up in the community and you want to GitHub to recognize it and maybe syntax-highlight it, then you need to come along and add that to Linguist. The longest time it’s been myself and one of the GitHubber merging pull requests, and we just realized that the rate at which the project was able to move from being responsive was actually really severely limited by our attention. So I went and looked at who made the best pull request and being most responsive on the project in the past 6-12 months and I actually just gave a couple of those people commit rights to master.
We’ve got a little bit of policy around who gets to do releases still, just because it’s kind of coupled to our production environment, but doing that has just breathed new life into the project, and I think one of the things that was not straightforward, but you can get it from the pulls page, to see who’s got the most commits to master in the last year or two… Paying attention to who’s active on your project and then thinking about their role - it’s not the kind of hard metric, but thinking about who’s around and who actually really understands and cares about the project, has been contributing… I don’t know, I’m just reflecting on that; it’s only a few weeks that we’ve been doing it, but it’s been really successful so far, and has really put a shot in the arm in terms of energy of the project.
[01:05:02.19] My approach with open source projects I maintain like that is based off a Felix Geisendörfer’s blog post, which was I guess a couple years ago. He basically just goes, “If someone sends me a progress, I’m just gonna add him as a contributor. Because what’s the worst that could happen? If they merge something I don’t like, then I can just back it out.” And later on maybe give them release rights when they’ve kind of proved themselves a little bit that they’re not gonna go crazy… Which seems to work really well, so you get a lot more initial contributions, and those people might not stay around very long, but you see a spike in activity.
And that really developed in the Node community, too. Eventually, that turned into open-open source and more liberal contribution agreements. It’s really the basis now for Node’s core policies as well. There’s been a lot of iteration there on how you liberalize access and commit rights and stuff like that.
It’s been quite interesting to have GitHub actually go like, “Oh, this is the third pull request you’ve received from this person. You should consider adding them as a collaborator so they can do this themselves.”
Yeah, that’d be awesome.
In the Node project we do a roll-up every month just to show, “Okay, these are the people that merge a lot of stuff”, and then there’s a note next to them if they’re a committer or not, so that they can get onboarded if they’re not. That’s how we base the nominations.
If that was automatically integrated into GitHub it would save me so much time… Not having to run those scripts and post those issues, it would be fantastic.
I think Ruby on Rails runs a leader board as well of the total number of commits into any of the rails projects, and you can kind of see a little star next to the ones who are currently Rails core. It kind of gamifies it a little bit, which I don’t know if that’s a good thing or not. I guess as long as it’s people actually doing stuff for the contributions rather than just to get up the leader board…
I think it’d be cool to see that for other types of contributions too, like people that are really active in issues or people that are doing a lot of triaging work, or whatever. I hear that from people, of “Well, I also wanna recognize all these other people that are falling through the cracks or that we don’t always see.”
Right, yeah. We did this blog post recently called “The Shape Of Open Source” that kind of just shows really clearly the difference between the types of activities around a project as the contributor pool grows. You can see that the lion’s share of the activity goes from commits if it’s just a solo project to actually comments on code, and pull requests to actual code review, but then just comments on pull requests and issues, and replies to those issues. It just demonstrates the project’s kind of transitioned to… A lot of it becomes user support, and that’s a ton of work and it’s something that I think what that contributor role is. There’s been some nice thinking going around that, but I don’t think it’s yet kind of baked itself into changes in the way products like GitHub actually work.
Well, to wind this down a little bit and look more towards the future - are there any trends like that that you see actually growing over time? I’ll ask this to both of you… We’ve talked a lot about what the data looks like right now. If you look at the data now, compared to last year or compared to the year before, what are the biggest growth areas in terms of what this data looks like?
[01:08:47.15] Well, for me there’s an accelerating number of packages everywhere, across every package manager that is in a language that is still very active. Perl is slowed down a little bit, but most package managers seem to continue to gain more and more code. There’s just more choice and more software to keep track of and to choose which things you should use. There’s never just like, “Oh, there’s the one obvious choice for this thing.” It feels like it’s reaching a point where… The internet happened 10-15 years ago, where the Yahoo! curated homepage was no longer useful because they couldn’t keep up with the amount of things that they were putting in. We have the equivalent in awesome lists where people are manually adding stuff. It’s kind of like the Yahoo! Directory of the internet, whereas you need something like Google to come along and go, “Actually, here’s the things that are gonna solve your problems.”
The dependency graph does give you something like a page rank to be able to go, “If we used a combination of links to that…”, either the GitHub page or the NPM page, and dependencies from actual software projects, you would then have a good picture of the things that are the most considered to be useful. Which is something that I’ve tried to put in, but there’s a huge amount of work to keep on top of and to build at essentially Google again, but for software.
Right. Clay Shirky has been mentioned once already on this today, but let’s mention him again - he’s like, “The problem is filter failure, not information overload.” I think currently a lot of what we’ve talked about today, it’s like it’s hard to find the right thing, because the volume of open source software is growing exponentially.
I think it’s almost becoming standard to hear some of these conversations happen. Now people are like, “Yeah, but how can we measure health? How can we know whether a project is doing well?” How is the data changing? I don’t know that the data is changing necessarily that much; I think Homebrew’s adding those metrics to capture usage, I think that’s a really good step in the right direction.
Some of this is there’s data missing that we don’t necessarily have, and it will be better to have more explicit measure of consumption in the use of open source.
I think the other part of it, the biggest change that I’m seeing is that the conversation is moving pretty fast, and that to me speaks of a demand and a better understanding of the problem generally in the community, and I think that means that we’re likely to see product changes and improvements that help solve some of the really common issues for people.
That’s great, I’m excited!
There’s a lot of people working on that kind of area as well. Did you see the Software Heritage project that was released yesterday?
So far they’re just collecting stuff, but building those kinds of tools on top of all of that, like the internet archive of software, could be a really powerful way for collecting those metrics and making them distributed out and allowing people to do interesting things on top of them
I think we’ll leave it there. Thank you all for coming on, this was amazing.
Thanks for the conversation.
Thanks very much.
Our transcripts are open source on GitHub. Improvements are welcome. 💚