GitHub and Google on Public Datasets & Google BigQuery with Arfon Smith from GitHub & Felipe Hoffa and Will Curran from Google (Changelog Interviews #209)

All Episodes

Arfon Smith from GitHub, and Felipe Hoffa & Will Curran from Google joined the show to talk about BigQuery — the big picture behind Google Cloud’s push to host public datasets, the collaboration between the two companies to expand GitHub’s public dataset, adding query capabilities that have never been possible before, example queries, and more!

Changelog++ members support our work, get closer to the metal, and make the ads disappear. Join!

84 minutes
Recorded Jun 29, 2016
Published Jun 29, 2016
Download (81MB)
Transcript
🎧 39,382

Featuring

Arfon Smith – Website, GitHub, X
Felipe Hoffa – GitHub, X
Will Curran – Website
Adam Stacoviak – Website, GitHub, LinkedIn, Mastodon, X
Jerod Santo – GitHub, LinkedIn, Mastodon, X

Sponsors

Toptal – Take control of your career and join the best at Toptal. Email Adam at adam@changelog.com for a personal introduction to our friends at Toptal.

Linode – Our cloud server of choice! This is what we built our new CMS on. Use the code changelog20 to get 2 months free!

Full Stack Fest 2016 – Early Bird tickets available until July 15. Use the code THECHANGELOG after July 15 to save 75 EUR (before taxes).

Notes & Links

📝 Edit Notes

This show was produced in collaboration with GitHub and Google to announce the big expansion to GitHub’s public dataset on BigQuery.

The Changelog #144: GitHub Archive and Changelog Nightly with Ilya Grigorik
GitHub announcement
Google Cloud Blog announcement
Google Open Source Blog announcement
Felipe Hoffa - GitHub on BigQuery: Analyze all the code
GitHub public dataset — This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
NOAA Global Surface Summary of the Day Weather Data
USA Name Data
Google BigQuery
Gist: BigQuery Examples from Arfon Smith
Shawn Pearce (Google) - the unsung hero at Google who did all the hard work getting the data pipeline working for this new dataset
Email bq-public-data@google.com to talk with Will and BigQuery’s public dataset team

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Welcome back everyone. This is the Changelog and I am your host Adam Stacoviak. This is episode 209 and today Jerod and I have an awesome show for you. We talked to GitHub and Google about this new collaboration they have. We talked to Arfon Smith from GitHub, Felipe Hoffa from Google and Will Curran from Google. We talked about Google BigQuery, the big picture behind Google Cloud’s push to host public data sets for BigQuery as the usable front end. We talked about the collaboration between Google and GitHub to host GitHub’s public dataset, adding querying abilities to GitHub’s data that’s never been possible before. We have three sponsors today. Toptal, Linode, our cloud server of choice and Full Stack Fest.

Alright, we are back. We’ve got a fun show here… I mean Jerod, we’ve got some back story to tell, a little bit to kind of tee this up. Back in episode 144 we talked to Ilya Grigorik, a huge friend of the show; we’ve had Ilya on the show I think three times now, is that right?

I think that’s right. In fact, we are gonna have him on this show as well. We have three awesome guests and we figured we’d let them take the spotlight, since they have been highly involved in the project as well as Ilya.

Right. So we’ve got GitHub and Google coming together, Google cloud specifically, along with Google BigQuery. Fun announcement around data sets that run GitHub, opening those up, BigQuery… We use BigQuery actually as sort of a byproduct of previous work from Ilya with was GitHub Archive, and we worked with him to take over the email that was coming from that, and now we call that Changelog Nightly. So that’s kind of interesting…

Yeah. in fact we had a brief hiccup in the transition, but one that we are happy to work around. What they have been doing behind the scenes is making GitHub Archive and the Google BigQuery access to GitHub lots more interesting. We are gonna hear all about that.

Absolutely. So without further ado, we’ve got Felipe Hoffa, Arfon Smith and Will Curran. Felipe and Will are from Google and Arfon, as you may know, is from GitHub. Fellas, welcome to the show.

Hi. Thanks for having me.

Hello there.

Nice to be here.

So I guess maybe just for voices sake, and for the listeners sake, since we have three additional people in this show and it’s always difficult to sort of navigate voices, let’s take turns and intro you guys. I got you from top to bottom, Felipe, Arfon, Will. So we’ll go in that order. So Felipe give us a brief rundown of who you are and what you do at Google.

Hello there. I am Felipe Hoffa and I am a developer advocate, specifically for Google Cloud and I do a lot of big data and a lot with BigQuery.

And Arfon, how about you bud?

Yeah. So my name is Arfon Smith and I am GitHub’s program manager for open source data, so it’s my job to think about ways in which we can be sort of more proactive about releasing data products to the world and this is what we are gonna talk about today, it’s a perfect example of that.

Awesome. And Will, how about you?

[04:00] Hi there, this is Will Curran. I am a program manager for Google Cloud platform and I am specifically working on the cloud partner engineering team. So my role is in the big data space and storage space, to help us do product integrations with different partners and organizations.

The main point of this show here in particular is obviously touching back on how we are using GitHub Archive, but then also how you two are coming together to make public datasets around GitHub available, collecting these datasets, showing them off. I am assuming a lot of new API changes around BigQuery. Who wants to help us share the story of what BigQuery is, catch us up on the idea of it, hosting data sets, what’s happening here? What’s this announcement about?

So we can start with what are we doing with GitHub or what is BigQuery?

Let’s start with the big picture, BigQuery. Public data sets, Will… And this is a big initiative of yours at Google. GitHub wanted those public datasets, but give us the big context of what y’all are up to with the public data sets?

It started with Felipe. He has been working for a while now with the community and different organizations to publish a variety of public datasets and we’ve got a lot of great feedback from both users and data providers. And one of the things they have said is that they want more support for public data sets in terms of resourcing and attention so that they can get more support for not just for hosting those datasets, but for maintaining them, which is our biggest challenge right now. So we developed a formal program at Google Cloud platform to launch a set of datasets that Felipe had been working on for a while, and we launched those at GCP Next, earlier this year. So the program basically provides funds or data providers to host their data on Google Cloud as well as the resources to maintain those datasets over time so that there’s current data. So the program allows us to host a much larger number of datasets and bigger data sets, and currently we are focused on growing the available structured data sets for BigQuery, but then we’ll start adding more binary data sets to Google Cloud storage. As an example, Landsat data would be a binary data set that we are looking to onboard. And then that brings us to this week’s announcement around our GitHub collaboration.

I would love to highlight this about BigQuery. We can find open data all over the internet - that is awesome. But what’s special about data shared on BigQuery is that anyone can go and immediately analyze it. Everywhere else you have to start by downloading this data or by using certain APIs that restrict what you can do. When people share data on BigQuery, like for example the GitHub archive that Ilya has been sharing for all this time, they say it is available for immediate querying by anyone and you can query it in any way you want. You can basically run a full table scan that runs in seconds without you having to wait hours or days to download and analyze the data at your home.

It kind of reminds me of The Martian. The guy’s like: “Hey, I need to do a big analysis on the trajectory of the orbits” and stuff like that; if anybody’s seen The Martian, he’s like, “I need supercomputer access.” It seems kind of like supercomputer access to any dataset if that’s what you want.

Exactly. Once we have the data set in BigQuery, anyone… Like, you just need to login. Everyone has a free terabyte every month to query, has access to basically a supercomputer that is able to analyze terabytes of data in seconds, just for you.

I know one of the things that - and Jerod, you can back me up on this, with piggybacking off of Ilya’s work with GitHub archive and now Changelog nightly is that email - that wouldn’t be possible without BigQuery, because those queries happen so fast, it takes so much effort on the computer’s part and effort to get those queries on that big data set. [08:07] I mean, that’s pretty interesting. I like that.

Yes, so Ilya was the one that started sharing data on BigQuery. As he told you episode 144, he was collecting all these files, he was extracting from GitHub all the logs and BigQuery was opening up as a product at the time. He chose BigQuery to share this data set, and since then we have shared a lot more data sets in BigQuery. All the New York City taxi trips, Reddit comments, Hacker News etc. and you are able to analyze it. And now what we are doing with Will, is a roll this into a formal program, to get more data, to share more data, to make it more awesome for everyone.

So those are interesting data sets. Will, maybe give us a few more interesting ones, specifically that would be cool for developers and hackers to look at and perhaps build things with. Either ones that you guys have currently opened up since our last show, which was February 2015 - quite a bit ago - or things are you hoping to open up, that would be interesting for developers.

One of the ones I like using myself is the NOAA GSOD data. I have a lot of interest around climate change themes and topics, and what I found interesting with that dataset, and Felipe did some great documentation on how to actually leverage that data, is you can go right in there and instantly get, in a matter of seconds, the coldest temperatures recorded over time, and they have been tracking it back since like the 1920s, and the hottest ones, and immediately you can see the trends that everybody’s talking about, where in the past decade or so we have hit a lot of record temperatures that have not been seen in previous decades. It’s kind of exciting just to be able to ike pick up a data set like that and validate a lot of the science in news that you are reading, right?

That is interesting. I was gonna say, how do you go ahead and get started with that, but maybe we’ll save that for the end of the conversation once everybody’s appetites are sufficiently whetted. Let’s talk about the subject at hand, which is this new GitHub data. We have had since Ilya set up the GitHub Archive back in the day, we have had some GitHub data which was specifically around the events history and issue comments and what not, but y’all been working hard behind the scenes, both Google and GitHub together, to make it a lot more useful. So maybe Arfon, let’s hear from you, the big news that you guys are happy to announce today.

Yeah, as you kind of will be well aware with the existing GitHub archive, you know the GitHub API, spews out all these events, like hundreds and hundreds per second of public kind of record of things happening on GitHub. Things like when people push code, when people start a repo, when orgs are created - all these kinds of things that already happened, and these are just JSON kind of blobs that come out of the API. So the GitHub archive have been collecting those for about five years now. But what we are adding to that is the actual content that these events describe. If you had a push event in the past, so somebody pushing code up to GitHub, you had to go back to the GitHub API to go and get the file, for example, a copy of what was actually changed. What we are actually adding to BigQuery is a few more tables, but these tables are really, really big. So we have got a table full of commits. Every commit now has the full message, from the author, the files modified, the files removed, all the information about the commit and the source repo. [12:01] That’s about 145 million rows. It’s probably more now, it’s probably upwards of over 150 million. We’ve got another table which has all the file contents. All of these projects on GitHub that have an open source license, the license allows the third parties to take a copy of this code and go off and do things with it; that’s kind of one of the great things about open source. So there is now a copy of these files in BigQuery tables. This is the big one, this is about 3 terabytes of raw data that has the full file contents of the object that was touched in the repository on GitHub, and I’m sure we’ll dive into the possibilities of what you can do with that.

And in addition there is another table which basically has a full mapping of all of the files at Git HEAD in the repository, a mapping of all the files and all their paths, and joining them to their file content. There are about 2 billion of those file paths. So basically we got his kind of vast network of files, commits and now also the contents of those files sitting ready to query in BigQuery. I think we are upwards of about 3 terabytes data set here, and it’s the biggest data release that we have ever made.

That’s awesome… It sounds like a lot of work. I’m just sitting here thinking, “Man, it’s a lot of work even describing it.” I’m sure both sides have put a lot of effort in it. Can you describe the partnership, the way you worked together, the two companies and from your perspective what all went into making this happen?

Sure. So I’ll start, but I’m sure there’s more detail to come from Felipe as well on this. So the unsung hero of today’s call is… Well, two really - Ilya of course, but a guy called Shawn Pearce, who works in the open source office at Google. So, you know, the desire for data from GitHub is kind of like a general request we get from large companies who are doing a lot of open source. We get that from Google, regularly pulling data to analyze their own open source projects on GitHub; so Shawn had actually done some early work, exploring pulling these commits into BigQuery. He started to kind of build out a pipeline to help monitor their open source projects. But we have pretty good regular conversations with him and the team he is in, and so I think it just came up in one conversation back in February. He was like, “Hey, by the way, I have been working on this thing… We have this public data set program that is growing and this would make a great data set to have available in BigQuery. What do you think?” And we jumped at the chance to get involved.

We spent a few months in development to make sure that the pipelines all working, but the lion’s share of the work has been done by Shawn on the data pipeline, which I think runs every week to update this, but Felipe, could you remind us if that’s the case?

Yes, at least today it is set up to run every week. So this snapshot will be updated every week with the latest files, details in GitHub.

I have a quick story about the partnership. When I was first approached with this, it was Shawn, and I got introduced to Arfon, and one of the first questions I asked when I talked to a data provider about if this is going to be useful or whatever given the backlog that we have, I asked, “Can you send me a sample query that shows how this’ll be useful to users?” and one of the first queries that Arfon sent was “the number this should never happen” [laughter]. [16:07] I knew it was going to be fun just working with this data. I have just run the query actually, after our last load here, and we are not quite at a million times yet, but we are getting close.

What do you mean by that “shouldn’t have happened”?

That’s the number of times in this dataset that someone has committed a comment that says “This should never happen.”

Oh, gotcha.

So it says that in the commit message or is it actually in the code comments?

In the code.

In the code?

Yeah. It’s like rescuing every error you could possibly imagine. This will never happen. This should never happen.

We are almost at a million.

Right. And so you are like “Yeah, okay…”, but it’s in there. There was a thing on Hacker News a few months with this kind of thing. I think somebody demonstrated that; I think they did a search on the GitHub site, on our standard search, to say, “Let’s see how many times something should never happen.” Now you can do this with kind of looking at particular language types as well, and do much more powerful searches. That’s one of the things that is kind of fun about the data.

That’s a great use case. What I am excited about this is especially getting it out to our audience and to the whole developer community. There’s all these new opportunities and use cases, and things that we collectively couldn’t know previously, and we can start to know, by people asking different questions that I wouldn’t have thought of, or you wouldn’t have thought of. We are going to take quick break, but on the other side, what we wanna know is what all does this open up? Obviously there’s things that we haven’t thought of yet, but what’s the low hanging fruit that’s cool, that you can do now? You can ask these questions and you can get answers that you couldn’t previously get. So I’ll just tee that up, and when we get on the other side of the break we’ll talk about it.

Alright, we’re back with quite a crew here, talking about big data, Google BigQuery, GitHub… Fun stuff. In the wings when take these breaks, we often have side conversations and it had just occurred to us that everyone on this call is in a unique place. For example Felipe, you’re up in the YouTube studios in New York because you are at a conference up there. Arfon, you’re in a truck outside of a Starbucks in Canada, while you are digital nomading with your family in your travel trailer, and you’ve got a super fast internet connection. And Will, you’re where you should be, in Seattle, in your home office there, in Google studio there… So it’s kind of interesting. So Arfon, what’s unique about where you’re at right now, I guess?

Well, the speed of the internet is remarkable. As I say, outside Starbucks with about a hundred megabit connection, so that’s pretty great.

That’s unheard of.

Yeah, so I can report that the Canadians have better Starbucks Wi-Fi internet than the Chicagoans, which is where I have lived for the last four years. What else is unique… It’s lovely and sunny, but I have only been in Canada for three days, so I have no idea if it’s regularly sunny here. But yeah, it’s really nice.

And the good thing for us with this scenario for you is that we get to capitalize on a great recording because you sound great, it’s going great. We don’t have any glitches whatsoever, so thanks Starbucks for superfast internet connections in Canada. We appreciate that.

[20:02] Yeah, it’s sponsored by Starbucks. I probably can’t say that, right?

We’ll have to reach out to their PR department or their marketing department to send them a bill for this show or something like that, but on to the more fun stuff, though. So Jerod teed this up before we went into the break, but big story here. Google BigQuery has been out there, we are aware of it, but now we are able to do more things than we have ever been able to do before. So let’s dive into some of these things… What are some of the things you can do now with this partnership, with this new dataset being available there, the four terabytes or three terabytes of public data being available there - what can you now do that you couldn’t do before?

The beauty is that anyone can do it… So it’s not just me, but anyone; it’s open data. But just having access, being able to see two billion files, to be able to analyze them at the same time is really, really awesome. For example, let’s say you are the author of a popular open source library. You can go and find every project that uses it, and not only that they are using it, but how they are using it. So you can go and see exactly what patterns are people using, what are the doing wrong, where they are getting stuck, and you can base your decisions on the actual code that people are writing.

Yeah, I think the kind of insight into how software that maybe you maintain is being used is one of the most powerful ones that I think of here. Because for example, say you are wanting to make a breaking change to your API - actually one of the project I maintain on behalf of GitHub, a project called Linguist, we wanna change one of the core methods, actually one that detects the language of the file. We wanna change its name and we wanna re-architect some of the library, and we know it’s a breaking change to the API and we have had deprecation warnings out for 12 months, but honestly being able to run a query that sees how many times people are actually using that API method still, helps me as a maintainer understand the downstream impact of my changes. And currently, that’s just not been possible before. And of course, you can’t see what’s going on in private source code, but a lot of this stuff is in open source repos as well. Being able to drill down into source code, all of the open source code that’s on GitHub… And for me the other kind of killer feature is like, to be able to do this you wanna write a regular expression of some kind right, and being able to run regex across four terabytes of data or three terabytes of data - we should actually figure out what the exact number is, it increases daily of course, but being able to run a regex against all that data is incredibly powerful and something that has just not been possible before.

A while back we had Daniel Stenberg on the show, he is the author of curl and libcurl, of course, and we asked him at that time, “How do you know who your users are, how do you speak to your users and ask them things?” and really he said “I have no idea.” First of all curl is so popular, it’s kind of like SQLite- the world is his users… But he didn’t really know how people were using his library, but with something like this… Like you said, it’s only the public repos, of course; we wouldn’t wanna expose the private repos to big data, but he can actually just go to BigQuery and look for how many people are including libcurl, linking to it in their open source… And not just that - he can also, like you said, look at very specific method signatures or how they are using it. He can inside. Now, it’s not 100% the truth because he’s got way more users than just open source, but it’s at least a proxy for reality. Is that fair to say?

[23:59] Yeah. And there are fun things you can do as well. We are sharing some example queries that we have authored as a group, but of course, you know there’s unlimited possibilities here, but you can also look at most common emojis used in commit messages, and silly stuff like that. So there’s less serious things you could do as well that would also currently be pretty difficult. But yeah, being able to drill down and understand how people are using stuff is extremely important to many people.

Actually, one use case that is very near and dear to my heart… I mean, everyone’s interested if people are using their stuff, but some people actually have to report that, because maybe… One particular use case that I am very familiar with is people who have received funding to develop software - maybe academic researchers who develop code - they’ll have funding from maybe the National Science Foundation, and the only thing that matters really to the NSF is what was the impact of that software, and it’s really hard to answer that question, like how many people are using your stuff. You can maybe say “Oh, well it’s got 400 forks.” Now, I would say anything that has been forked 400 times is pretty popular, but it doesn’t actually mean it’s being used; it’s kind of a weak kind of signal of usage, whereas an actual “I can show you, I can give you the URL of every downstream consumer of my software, and it’s being used by 50 different universities” or whatever… But being able to give people the opportunity to actually report usage is interesting and fun for a lot of people, but actually mission-critical for many people as well.

We get a lot of requests at GitHub from specifically researchers who are trying to demonstrate how much their stuff have been used. It’s really been hard to sort out those requests in the past, but I think we are going to be in a much better position to do that now.

Another interesting use case, Felipe maybe you can speak to this one, probably exciting both for white hats and black hats alike, is an easy way of finding who and what exactly is using code that’s vulnerable to attack. Can you speak to that?

Yes, so I’m super excited about that. Security wise, if you are able to find and fix the problem in your source code, that’s cool, but if you are able to find the same pattern, the same buggy code or potential vulnerabilities, with BigQuery you will be able to find it all around GitHub’s open source code, and just send patches, contact the project owners, open an issue… But now you are able to do this, and things get really, really crazy with the kind of things you can do. With BigQuery you can write SQL. SQL is powerful, but you can only do some limited amount of operations, you can write regular expressions. But with BigQuery we also open the space up to user-defined functions written in JavaScript. For example, there is this JavaScript static code analyzer called JSHint, and I’m running it now inside BigQuery just to analyze other JavaScript code and find, for example, all of the unused variables. Like, you cannot do that with a regular expression; if you try to run this in your computer it would take days, but with BigQuery you are able to just actually analyze the flow of the code, and all of the unused variables over the libraries being used. So it’s getting really crazy. I am now getting to maybe the boundaries of what we can do with BigQuery, but I’m really looking forward to what people will build up on this.

[28:08] Let’s focus on security aspect once again, with regards to the black hats. A naysayer of this type of available data is that now you have a zero day come out or… Well, let’s just call it zero days released, and now this enables - whether it’s a script kiddie or somebody who is more capable can go out and just fuzz the entire internet for vulnerable things, but they can actually know exactly what line of code in a particular project is taking this input. So people can go out and pull requests, people can also go out and hack each other. Do you have concerns about that?

Well, I believe in humanity on one side. [laughter] I think there are more good people than bad people, and usually people when they attacking, they are more focused on particular projects. On the defense side, here we’re giving the ability to people that want to make the project stronger, we are giving them the ability to identify everywhere where the potential problems are and harden these open source projects. That’s one of the beauties of open source. Yes, it makes problems more visible, but by making them visible, you have more eyes looking at them. Now, with having all this source code visible in BigQuery, we are just making people that want to look for problems - we are giving the tools to find them easily, in an easier way, and fix them.

Yeah. I look at it very much like it’s a tool. You can use a tool for bad, you can use it for good, and if anything what this does is it ups the ante or it speeds up the game, so to speak. So both sides can use it. I would imagine, if you think about believing in humanity, the good people, it just takes one to person to go out there and write a program that can use this dataset, query BigQuery for a specific string of code, and automatically find that across all of the repos on GitHub and open a pull request, just notifying them of the vulnerabilities. In a sec or moments, without any user interaction, I think we’ll see stuff like that start to pop up, which is pretty exciting.

Exactly. We always tell people within open source that more eyes means more secure code, and that benefits a lot of open source projects. But if you have a very obscure open source project, maybe no one will look at it. Maybe no one will be looking out to harden your code. But this gives a lot more people the ability to look into your obscure project, because it will be eyes looking everywhere.

Well, just think about it now. Right now we have not so much no eyeballs, but very little eyeballs, because the process to have such knowledge is difficult… Whereas with this partnership, this data set available on BigQuery and the good stuff, now people have a much easier way to find these insights, and then obviously knowledge is power, so in this case I’m on Felipe’s side, Jerod. I am kind of - not a naysayer, so to speak… I’m like, “Do it!” because I think about… In this show that’s gonna come out after this we talk to Peter Hedenskog of Wikipedia about site speed, and we talked a lot about automating reporting of site performance, and this is similar to your point, Jerod, where you said “Could we automate some things where pull request is automatically opened up?” I think about the automation tools that may be able to take place on the security side to say, “Okay, here is a vulnerability.” [32:02] It also opens up another topic I wanna bring up, which is not just the GitHub data store but other data stores, or code stores like Bitbucket or GitLab, having similar data sets on BigQuery and how that might open up insights to all stores, into all the major stores. But long story short, automating those kinds of things to the open source out there, that’s an interesting topic to me.

I was gonna say, a fun experiment is - actually don’t do this, I’m not recommending this… [Laughter] But if you commit a public access token from your GitHub profile into your public repo, you’ll get an email from us within about a second, saying we disabled that for you because you probably didn’t wanna do that. So I think there’s actually… Scanning and making open source more secure is something that we care a lot about. We think it’s in everybody’s interest, we think about software is best when it’s open and so… We have all committed stuff accidentally and had to rewrite history; you know, humans are humans, so I think the things that we can do to improve tools to help people stay safe and help their applications stay safe, I think that’s really, really important. We do that currently for GitHub tokens, but you can imagine… I should probably want the same level of service if I commit, you know, an Amazon token or a Google Cloud token or whatever it is. Something that exposes me. That’s a kind of generically interesting area to work on.

So I think more eyes on open source is showing how data can be used to make people more secure. I think this just helps sort of accelerate improvements to things like GitHub, by making data more open.

One facet of this that we should definitely mention is that the data set that is provided is not real time, so when we talk about zero days or code that is currently vulnerable, you do have a lag time between when that snapshot is created. Now, previously you had told us it was two weeks and now Felipe’s telling us it’s one week, so apparently you all have gotten better at this since we even talked last.

50%!

Yeah, so that’s nice. I’m curious if there’s ever a goal to make that a nightly thing or if a week is good enough. What are your thoughts on that?

I mean, I would love to see… I think an obvious thing to do with big archives of data is to improve the frequency at which they are being refreshed. I would love to see these things get more and more close to live. Yes, so it’s how often the job runs. I think the job takes about 20 hours to run currently. We are going to hit a limit of how quickly the pipeline can run, but maybe it can be paralyzed further. I don’t know, Felipe do you recall how long it takes to do this right now?

What I can say is that things can only get better. [laughter] It’s amazing how things just improve while I’m not looking.

It’s our current bottleneck in data warehousing and analytics, and so you can expect that all cloud providers are gonna be optimizing for that, and getting as close to real time as possible.

What does it take… Can someone walk us through the process of capturing the data set, whether it dumps down to a file? What’s the process? Maybe even Arfon, on your side, what inside of GitHub had to change to support this? What new software had to be built? Walk us through the process of the data becoming available and then actually moving into BigQuery. What’s that process like? Walk us through all the steps.

[36:11] From GitHub side actually very little changed. I’m probably not the best person to talk to about the process of actually doing the data capture. I mean, we do regularly increase API limits for large API customers, so I think we did that… But Felipe, do you have more detail on this?

Yeah, let me make a parallel with the story Ilya told you when he was back here earlier last year. First he started looking at the GitHub’s public API, he started logging all of these log messages, and once he had these files he had to find a place where to store them, analyze them and to share them, and the answer was BigQuery. Now in 2016 we have a similar problem, just bigger. It starts by taking a mirror of GitHub, using their public API, looking at GitHub’s change history. Once you start mirroring this, you have a lot of files. Then the question becomes where do I store them? Where can I analyze them? Where can I share them with other people? That’s where Shawn Pearce is a superstar that writes these pipelines today - one mirror GitHub - and then putting it inside BigQuery as relational tables. That’s basically the Google magic in summary. But it takes a lot of mapreduces and doing things at Google scale to be able to just: “Oh yes, I downloaded, I made a mirror of all of GitHub”.

Right. I guess the thing I’m trying to figure out is what makes it take a week? What’s the latency in terms of capturing to querying inside of BigQuery? That’s what I’m trying to figure out like. What’s the process to get it there? It’s a good story there, but why does it take a week?

No, I think it may closer to a day, but it’s all about how many machines you have to do this. You want passive results, you just keep adding machines to it and then it becomes a question of how much quota do you have inside Google versus other projects.

And I hate to further compressing the time, like we are making changes now, but I think we are down to six hours in terms of the pipeline

Really? So we had a conversation a week ago, basically to tee at this conversation; it was two weeks then, then we thought it was a week, and now it’s six hours.

By the time this show ends, it’s gonna be real time. [laughter]

Yeah. Good job, Will. [laughter]

Felipe is actually coding right now as we talk, so…

Shawn is a start, but it’s all about getting more machine resources for the project and the more people use this dataset, the more important it becomes, and we start putting more resources in it. I’m really, really looking forward to what the community will do with this data and the toolset we developed over BigQuery to be able to just analyze the data in place.

So I have a good example of a question that is currently pretty much impossible to answer without this data set, if you’re interested. S

Absolutely.

So I was talking to a researcher about six months ago, and he was trying to answer the question - if you read a 101, getting started in open source, like how do you create a successful open source project, people will tell you it’s very important that you have good documentation. You wanna have your API documented, you wanna have a good README. And he was like: “You know what, I have used software where the documentation is really poor, but it’s still really popular and over time I have seen the documentation improve.” So his question was, “Is documentation a crucial component of a project becoming successful, becoming widely used?” [40:09] And to answer that question, you kind of need a timeline of every commit on the project. You probably wanna know the file path, what was in the file… Let’s say documentation in GitHub’s world is markdown, AsciiDoc, restructuredtext - even just those three extensions would probably represent about 95% of all documentation. So you can look at what code are most docs, but you can’t do that queries today.

As an individual, you would have to go and pull down, you’d have to Git clone thousands, hundreds of thousands maybe of repos from GitHub, store them locally and then write something that would allow you to programmatically go through all these Git repos, building out all these histories. These histories are now in BigQuery. I am not saying that I know exactly how to write that query, but the data is there, it’s possible now to answer this question.

And I think one of the most exciting things about this dataset is I think there is still a huge amount to be learnt about how people build software best together, and I think that’s not something that necessarily… The really hard questions, I think, are often best answered by people like computational social scientists, people who study how people collaborate, and they need really, really big data sets to do these studies. And today it’s just not really realistic for GitHub to… GitHub’s API is just not designed to serve those kinds of requests. It’s designed for building applications. I think we are gonna see a huge kind of uptick in the amount of really data-intensive research about collaboration or about open source software or about people best work together, powered by this data.

Yeah, that’s very exciting. And as people who are very much invested in watching the open source community do their thing and tracking it over time, I am excited about all the possibilities that are going to be opened up. I even think of just when GitHub Archive came out and all of a sudden we started having cool visualizations and charts and graphs, and people putting answers together that we didn’t know we could ask questions about, and now we have so much more. That’s super awesome.

I think what we are gonna tee up for the next section is BigQuery itself, because it does seem like a little bit of a black box from the outside. Like, how do you use it? How do you get started? How long do the queries take? There’s a free tier, there’s a paid tier. I would like to unpack that so that everybody who is excited about this new dataset can, at the end of this show, go start using it and check it out. We’ll talk about it when we get back.

Alright we are back, talking about BigQuery, GitHub, public data sets, all that fun stuff. Felipe, tell us about BigQuery. How do you use it?

BigQuery is hosted too by Google Cloud, so you just go to BigQuery.cloud.google.com. Basically it’s there, open, ready for you to use, to analyze any of the open data sets, or to put your own data. Just in case you are wondering if it’s only for open data, nope, you can also load your private data and it’s absolutely secure, private etc. But with open data you can just slide there and start query. Now, you will probably need to have a Google Cloud account. So if you don’t have one, you will need to follow the process there to create and open your Google Cloud account, but then you will be able to use BigQuery to analyze data and everyone can analyze up to a terabyte every month, without needing a credit card or anything.

You can choose which dataset to start with. I wrote a guide about how to query Wikipedia’s logs, those are pretty fun. But in the case if we want to analyze GitHub, we can go to the GitHub tables to find some interesting queries, where we have the announcement on the GitHub blog, on the Google Cloud big data blog… I’m writing a Medium post where I’m collecting all of the other articles I’m finding around. You will want some queries to start with.

Then the question is what questions do you want to ask. You have these tables that Arfon described at the beginning. One of the most interesting tables is the one with all of the contents of GitHub. So this has all of the open source GitHub files, a list of one megabyte, and that table has around 1.7 terabytes of data. And that’s a lot, especially if you are using your free quota. If you query that table directly, your free quota will be out in immediately.

So thinking of that, we created at first a sample table that’s much smaller. Let me check the size right now, I have it with me. I’ll tell you the exact size in a minute.

The things is, you can go to this table and you can run the same queries that you would run on the full table, but your quota, your allowance, your monthly terabyte will last way more. You can choose to run all your analysis there on the sample, and then bring it back it to the mega table, but it all depends what questions you’re asking.

I also created - this is outside the main project, but in my private space that I’m sharing, I created an extract of all of the JavaScript files, all of the PHP files, Python, Ruby, Java, Go. [48:07] So if you are interested in analyzing Java code, you might be better off starting from my table. And then you can start asking the questions you might have, or at least start with one of these sample queries.

A couple of things, let me interject here. So all of these things that Felipe is referencing, we’ll have them linked up in the show notes; so if you are listening along and have the show notes there, pop them open; we’ll have example queries and all the posts, both from Google and GitHub published around this. So that’s probably a good place to go. You mentioned your monthly allotment, or your threshold, I can’t remember the exact word, but your quota.

Yes.

Let’s talk about that. So BigQuery is free up to a certain point, and then you start paying. The reason for this example data set, which is smaller, is because if you’re just gonna run test queries against the whole GitHub repos dataset, you are gonna hit up against that pretty soon. Can you talk about that? Even as a user - we have Changelog Nightly going, and have for a couple of years now - we’ve never gotten charged, so I guess we are inside of our quota, but I don’t have much of an insight into what we are doing. How does the payment work and the quota? Is it based on how data you have processed?

Exactly. So BigQuery is always on. At least compared to any other tool, you don’t need to think about how many CPUs or how much RAM or how many hours you are running it, it’s just on always. Then the way it charges you is by how much data you’re querying. It looks at the tables you are querying, specifically at the columns, and the size of those columns. And that’s basically the price of a query.

So if a column is one gig or something like that, or half a terabyte, then you are essentially being charged to query half a terabyte?

Exactly. So today the price of a query is five dollars per terabyte query; so if a column is one gigabyte, divide by $5 by 1000 and that’s the price of your query, the cost of your query.

So assume I got my question asked. I used the GitHub examples data set or the subset for my development, and I have a query here; in fact, from some of your guys’ examples, here’s one. Let’s say it’s the “How many times shouldn’t it happen” one that Will talked about earlier. It appears that this thing pulls from GitHub repos dot sample files, and GitHub repo adjoins GitHub repos dot sample contents. So every time I actually run that in production, it’s going to add up the size of those two particular things and then charge me once per time I hit the BigQuery. Is that right?

Exactly. When you write a query, before running it, you can see how much data that query will process.

That’s handy.

Yeah, because basically it’s a static analysis. You have the columns from the tables we’ve mentioned, and then BigQuery knows basically the exact price.

I’m just thinking outside the box because you all have Adsense, and the way people buy ads, that you may actually have a bidding war at some point - or not so much a bidding war, but you might be able to have something where I wanna query these things several times a month, but I have a budget, and I’ll query them if it’s under this budget, and you might be able to do those queries if said budget is not met or is exceeded. That seems like something in the near future, especially as we talk about automation around this.

[52:00] Yes. The idea here is to make pricing very, very simple. If you are able to know the price of your query before running it, then you can choose to run it or not. It’s about, essentially, instead of querying the whole data set, instead of querying the full contents table (1.7 terabytes), let’s just query all of the Java files. If someone has not created the extract you need, maybe the best step on your analysis is extracting the data that you want to analyze.

Do you feel like you have any pushback at all for a higher free threshold for open data sets? Because there is always this sort of push, or this angst, I guess, where if you are doing something for the good of open source or something that is free for the world or just analysis, someone is always like, “Hey, can you make this thing free for open source?” And since this show is specifically about this partnership and the GitHub public data set being available, what are your thoughts on the pushback you might get from the listeners who are like, “This is awesome, I wanna do it. Can I have a higher limit?”

At least what makes me pretty happy is that we are able to offer this monthly quota to everyone. It doesn’t stop, it’s not for the first few days. You have access to this at least until, I don’t know… Every month you will be having this terabyte back to run analysis, and that’s pretty cool on one side. Then, if you want to consume a lot of resources, at least you are able to, instead of having to wait one month, at least you’ll have the freedom to have even more resources available.

And just to context that, because I agree… In cloud we’re continually getting feedback, and then just based on competition to reduce pricing and make things more optimized and more efficient and cost-effective. So where we were just a moment ago really was without BigQuery, in order to do analysis on any data set, you would have to go find that data, you would have to download that data and possibly play some sort of egress; you would have to upload it into your own storage on whatever cloud provider you are using - and there is a cost there - and then you’d have the consumption for doing any query on it. It’s a valid question, but right now we’ve already reduced the cost for public users, and I fully expect that yeah, people will be asking for more higher limits on querying the data, and I just expect we will continue moving and making things cheaper and more efficient for users.

I think that the steps you just mentioned there… For one, telling people “This is what it actually takes to do this without BigQuery”, and now that BigQuery is here we have taken so many steps out of the equation. You have obviously got Google Cloud behind it, the supercomputer that we have talked about in part one of the show basically, having access to that. And I think just helping the general public who is gonna have a clear interest in this, especially listeners of this show… Everyone who listens to this show is either a developer or is an aspiring developer, so they are listening to this show with a developer’s mind, so to speak. So they are thinking, “If I wanna use this, how can I use it?” But knowing the steps and knowing the pieces behind the scenes to make this all possible, it definitely helps connect the dots for us.

And what’s great about working with Google is it’s really in our core mission. Google’s core mission is to organize the world’s information to make it universally available. [56:02] So for the public data program, this is a natural extension of that mission within the cloud organization. I see these public data sets plus tools like BigQuery as - and I know this word gets overused, but its democratizing information even further. We’ve all been these unknowing, or knowing, or involuntary collaborators in providing public data, and so I like the idea that we all have equal access in these public data programs, and we are now getting meaningful access to that data.

Today we are doing a better job at making the data available for download. See data.gov for example. Public data is pretty accessible now, and so I think the next step though, and going back to that comment I made about meaningful, is to provide the tools that lower that ramp even further, and gives all these collaborators meaningful access.

We are starting with SQL, which for most developers and marketers is a pretty good level of entry for querying enormous sets of data, but I think we are gonna end with machine language powered speech queries. Felipe, Arfon and I aren’t talking about these queries that you have to construct, and managing your limits on the data; we are actually telling you just to ask the machine, the dataset, a question.

Let’s continue a little bit on the practical side of how you get that done. You mentioned the console, which is where you can write your queries and test your queries and run them. There’s other ways that you can use BigQuery as well, once have those queries written, for instance with Changelog Nightly… We’re not going into the console and running that query every night and shipping off an email - it’s all programmatic. So can you tell us what it looks like from the API side? How do you use BigQuery, not using the console?

Yeah, so BigQuery has a very simple to use REST API for the people that want to write code around it. Now we have a lot of tools that connect to BigQuery; Tableau is one of the big ones. In specifically open data we have a partnership with Looker, some of our public data sets that we are hosting with Will. We have a specifically Looker Dashboard built over them. I love Re:dash for writing dashboards, and that’s a dashboard software that was not created for BigQuery at all, but it was open source. People loved it, people started sending patches so it connects to BigQuery, so now you can use Re:dash to analyze BigQuery data. I just love using that one.

The new Google Data Tools is also a pretty easy way to just create dashboards. I’m sharing one of these dashboards specifically for GitHub, this GitHub data set tool. So yeah, you don’t need to know SQL. I just love SQL, but you can connect it to all kinds of tools, and also to other platforms like Pandas or R etc. Once you have a REST API, you can just connect to anything.

One last question on this line of conversation - we talked about how long it takes to process, to get the data into BigQuery. It was two weeks, then it was a week, then it was 20 hours and now it’s six hours. How about querying it? What do we expect if we are gonna do the GitHub, the full Monty, like this query for emoji used in commit messages, for instance? However many terabytes it covers… Are we talking like three seconds, 30 seconds, minutes? What do we expect?

[01:00:08.15] Depends a lot on what you are doing. Here we are really testing the boundaries of BigQuery. You can go way beyond doing just a grep; you can look at every word in every piece of code, split it, count it, group it, or a regular expression.

Some queries will take seconds, I love those. I love being able to just go on stage and just start with any crazy idea, code it and have the results while I’m standing out there. But sometimes there are queries that are more complex, that involve joining two huge tables together. BigQuery can do these joints, but when reaching the boundaries, it’s good to limit how much data you query for something.

I have this pretty interesting query that might take two minutes. What about if, just to get very quick results, we sample only ten percent of the data or one percent, and things start running a lot faster. But it’s really cool… On one hand you feel that, “Oh, I’m reaching one of the boundaries”, but at the same time you feel that, “Wow, I’m really doing a lot here.” Let me see if I can run a query now, while we talk. I’ll come back when I get my query.

Felipe, maybe you can multitask on that, sure, but let’s test you out. Earlier in the show, we were actually in the break, we talked about some things you have some affinities for, for what the possibilities of BigQuery and all these data sets being available might offer, and one of them you mentioned was being able to cross-examine in data sets. So for example you had said how weather may affect, I think it might have been pushes to GitHub or pushes to open source or something like that, but basically how you’re able to capture various large public data sets that maybe like traffic patterns weather, and the ability to deploy code or push code to GitHub… But what other ideas do you have around and what are some of your dreams for cross- examining data sets?

Just to answer the question, because I told you I was gonna come back to this, I copy pasted one of the sample queries. In this case we are looking at the sample tables with the sample contents. This basically has 30 gigabytes of code. I’m looking only at the Go files in this case, and I’m looking at the most popular imports for GO. Basically this query over 30 gigabytes ran in 5 secs.

Not too shabby.

That’s fast.

Yeah, that’s how cool things get. Yes, so going to back to dreams. Just seeing data in BigQuery, seeing people share data here whets my appetite for how can I join different data sets. For example, something I ran last year when I got all of Hacker News inside BigQuery, the whole history of comments and posts, was to see how being mentioned on Hacker News affected the number of stars you got on GitHub. I can send you that link too.

Or you can also have the public data set of the Changelog, and when we release new shows how popular that project might get.

That would be cool.

Yes, so we can see all these things moving around the world, the pulse of it and how each one affects the other - Reddit comments, Hacker News comments, the Wikipedia page views, and you can see the real effect on code, on what will be happening on GitHub code, on the stars, on how things start spreading around, and the ability to link these data sets… To add weather, like “Do people code more under good or bad weather?”

[01:04:10.05] Right. Let’s extend that a bit then. Another question we have for you, and this is more for all of you, this is not just to you Felipe, but keying off of this topic here, what would you like the community to do as a result of this? You have some pure love for cross-examining data sets, things like that, and as you can hear there’s a crazy storm here in Houston… You heard that lighting there. The hatches are being battened down now… My wife, she’s out there taking care of it. I gotta go join her soon, so maybe this show will end eventually, but in between now and then, what would you like the community to do? You got the listening ear of the open source world, hearing you guys talk about this stuff now, all these data sets being available… Well, maybe at some other point you could talk about some other data sets that might become to play here as well to fuel this fire, but what are your dreams for this? What do you want the community to do with it?

I’ll go. One of my favorite projects that uses GitHub data, you know open source data from GitHub, is libraries.io and I know you had Andrew on a few episodes ago. So I think there’s still a huge opportunity to lower the barrier to entry to people in open source. I think part of that is maybe product changes and improvements and changes to GitHub. You know, there’s like really interesting projects out there, like first pull request up for grabs, low-hanging fruits that are easy for the community to work on.

I’m convinced that there is in this data set the answers to questions like what makes a welcoming project for people to come and work together. We’ve got everything that everybody has ever said to each other, and all of the code that has been written, you can run static analysis tools on like code to look at the quality of that code, maybe how approachable it is.

There’s just a missing piece right now that if I am a twenty-something CS graduate and I can program like crazy but I have never participated in open source - and there are lots of these people - or maybe I’m just somebody who’s just got my first computer and I’ve heard about open source and I wanna get stuck in, I think there is a missing piece right now, that we are not connecting always this sort of supply in terms of the talent that’s out in the word with the opportunity of projects. Everyone wants more contributors, everybody wants people helping to build software with them together, so I’m really excited to see what the community are gonna do around those topics.

If you think about what Andrew has done with libraries, I think that is a really good example of stepping in that direction; but this makes really kind of richer, more intelligent kind of uses of that data for strengthening the open source ecosystem. That’s where I think the big opportunities are. And I think that ideas are free. There is money to be made doing that. If somebody wants to go and build a companies that solves that problem… I think that’s a generally interesting problem to solve.

Yeah. Lots of ideas come to mind for me on that. But on the note of Andrew, I think with libraries Andrew is actually querying GitHub’s API directly, so in this case he can actually go to BigQuery and get the same data maybe faster. He may have to pay a little bit for it, but he may not have to hit rate limits or things like that, or just actually have a much richer ability to ask questions of GitHub vs the API.

[01:07:59.06] Exactly, yeah.

Cool. Felipe, what about on your side? Any dreams?

For me I like comparing this with the story of Google. Google for me is the biggest company built on data. Basically you need data tools ideas. Data for Google was collecting the whole world wide web at that moment. Collecting it was not easy, but you needed the tools to store it, analyze it, and then you needed ideas. A lot of companies at that time, there were a lot of companies that had all this data, had a copy of the web, a mirror of the web inside their servers, but the ideas that Google had, like “Let’s do pagerank, let’s look at the links between pages to rank our searches”, that was huge.

I’m looking at the same right now with this and other data sets. We have the tooling. Tooling might be BigQuery. BigQuery gives you the ability to analyze all of this, but you can create tools above this and looking for other ways to see more static code analysis that will run inside BigQuery. You need ideas - that’s what I’m looking for the world to bring new ideas, new ways to look at this data that we are making available. And I’m looking out for data. We are making a lot of data available on BigQuery, and I would love people to share more. That’s why we have Will here also… If you have an open data set, if you want to share data, instead of just leaving a file there for people to download and take hours to download and then analyze on their computers etc, if you share it on BigQuery, then you make it immediately available for anyone to analyze, and then to join with other data sets. So for for me that’s…

Well since you mentioned Will… Will, there’s definitely one subject that I wanted save closer to the end, which is talking to you about the data sets that you’re… I mean, this is mostly around the partnership with GitHub and this data set, but what other data sets, as Felipe mentioned, what do you have your eyes on, what hopes do you have there?

Yeah. Well, what I’m focused on right now is trying to get data sets that address that accessibility issue I was telling you about earlier. A lot of the data.gov stuff like Medicare data, census data, some of the climate data… And what I find interesting about it that this data has been collected for decades, and so the schemas around this data were designed well before we even thought about big data challenges, much less just early even SQL… It’s like pre-noSQL challenges, right? We are talking prior to the ‘70s. So the challenge here is like taking a lot of this data that is coded, it’s truncated because at the time there were limitations on characters and everything else, and so is giving all that coded data which is technically available for download by the public but not usable, we are planning on onboarding some of the data from the government catalogs, like the census data, health data, Medicare data, patent data from both the US and Europe, and then some of the more weather-related data.

It’s a big challenge, because a lot of this data is decades old and was designed at a time before there was even SQL or big data, and so it’s heavily coded. The challenge here is to decode that data, which requires resources, and then structuring it in a way that it fits well into BigQuery, and then Felipe can take it from there to the community and construct all sorts of interesting queries and address that accessibility challenge that I was talking about earlier.

[01:11:59.02] Yeah, something that… I just told you guys I was gonna close with, but I actually wanna throw one tiny curve ball in here. It just occured to me during this show that as we were talking about the code insight, so to speak, the insights that comes from being able to have such deep querying into not just the events, but also the actual code and different files that are gonna be stored as a part of this… There’s obviously some motivation on GitHub’s side to do this, so Arfon, feel free to throw a mention here on this, but I’m kind of curious to all three of you, just whoever wants to share something about this, but I’m curious about how this opens the door for other code stores, stores from back in the day that are still kicking around, I’m not sure what their status is… You’ve got BitBucket, you got GitLab… Obviously, having this kind of insight is interesting, so does this open up the door for other stores? Is this something that’s a motivation for everyone to do that kind of thing?

Yeah, I’ll take a stab at that. I actually think that the open source software, wherever it is, is hugely valuable, and I would love to see more open source software available in a similar way to the way we are releasing this data today with Google. I think the more the better, as far as I’m concerned. You know, if we were ten years ago when a lot of open source activity was happening on SourceForge, and there is still stuff out there that is used and is still incredibly important, and of course people on Bitbucket and GitLab, and other hosts as well. So I would love to see more vendors participating in archiving efforts like this.

I think there is more to be done than simply just depositing data. I think there is also this sort of… We have the way that our API works. Bitbucket has its API, GitLab has its API. There’s differences between all these different platforms, even if maybe many of them are using Git or Mercurial as kind of a base level for the code. So I think there are actually really big opportunities to standardize some of the ways in which we kind of describe the sort of data structures that represent not only code, but all of the kind of pieces around it - the community interactions, the comments, the pull requests, all of these things.

I’m aware of a few community efforts. There’s one called Software Heritage, there’s one called FLOSSmole, where they have got for example all of RubyGems stuff in there and a whole bunch of SourceForge data. I’ve talked today about some of the things about empowering the research communities around these datasets. I think one of the issues with doing that right now is, you know, I spend most of my time thinking about GitHub, the data that GitHub hosts, but of course that isn’t all of open source, and I think making sure that it’s possible for all of software to be studied I think is gonna be really important going forward. There are a bunch of opportunities there about improving platform interoperability, but I don’t think many people are talking about it right now, and I would love to see some advancement in that, because I think it’s good for the ecosystem at large.

Yeah. I would like to highlight also the technical side. There is a big technical problem, and the question here is are we able to host all of GitHub open source code in one place and then analyze it in seconds? Well, we just proved that we can, so let’s just keep bringing data in, let’s keep furthering the limit. But yes, technically we can solve this problem today.

[01:15:54.23] That’s a good thing. I mean obviously Will, with your help and Felipe, your abilities to lead this effort and Arfon, your efforts on the GitHub side of things to be open to this… I think part of this part of this show is one, sharing this announcement, but two, opening up an invitation to the developers out there to the people out there that are doing all this awesome open source and dreaming about all this awesome open source, having this invitation to bring their company’s datasets, if there is open data out there, to BigQuery.

Will, what’s the first step for something like that? You said that that’s an open door. Obviously, if ten thousand people walk through the door at once it’s not a good thing, because you may not be able to handle it all, but what’s the process for someone to reach out? What’s the process to share this open data?

They can contact us, and I’m trying to pull up just so I can get the… It’s on the cloud.google.com site, under our data set page. They can contact us. Where is that email? I will give that email to you so you can put it in your accompanying doc, but I would also encourage them to reach out to Felipe on Reddit on the Medium post and just get a hold of either of us that way.

We’ll have that Medium post in the show notes, so if you’ve got your app up…

I just got it. It’s bq-public-data@google.com.

Yes. I would like to add that on the technical, if tomorrow ten thousand people want to open data sets on BigQuery, that’s completely possible. Anyone can just go and load it on BigQuery and then make it public. What we are offering with this program is support to have your data set publicized, shown, taking care of paying for the hosting price, but you can just go and do it yourself. Working with us is cool, but you don’t need to go through a manual process, you can go and do it.

That’s an excellent point, and to be clear, you can upload your data and then put ackles on it to make it public, and then anybody that queries that data, you are not gonna be charged for their queries.

Gotcha. That’s good then. So you can mainly do it if you have a big data set and you want some extra handholding, so to speak. So email the email you’ve mentioned. We’ll also copy that down and put it in the show notes, but it’s possible to do that on your own, as you mentioned, through the BigQuery interface and making it public and not being charged. That’s a good thing.

Let’s wrap up because I know I had a storm. We had a quick break there because of the storm, and my internet outage for about five minutes, so thanks for bearing with that and listening on. You probably didn’t even hear it because we do a decent job of editing the show and making things seamless when it comes to breaks like that. This is time for some closing thoughts, so I’ll open it up to everyone, whomever wants to take it, just some closing thoughts on some general things we’ve talked about here today. Anything else you wanna mention to the listening audience about what’s happening here?

Alright, I’ll go. I’m incredibly excited to see this data out in the public. I think we talked a lot today about public data, sort of open data, but also useful data, useable data and I think this is the first time that we have been able to query all of GitHub, and I think that’s an incredible opportunity for studying how people build software, understanding what it means for projects to be successful. Honestly, I think, the most exciting thing for me about this is that data is now available. It’s out there and I think the possibilities are near limitless. I can’t wait out see what the community does with this dataset.

[01:20:02.27] Well Felipe, anything to add to close?

I would love to add for anyone analyzing data - it doesn’t need to be open data; I love open data, but anyone that’s analyzing data today that is suffering, waiting for hours to get results, having a team managing a cluster, maybe sit in a cluster overnight, try BigQuery. Things can be really fast, really simple, and that will open up your time to do way more awesome things.

Awesome. I can definitely say that we have enjoying BigQuery… But go ahead, Will, you had something to add?

I just wanted to add to what both Arfon and Felipe were saying around communities - what I am really looking forward to is seeing the community participate in developing interesting queries, and I’m sure there are datasets out there that are interesting that I’m not aware of, and I would love to hear about those and try to get those more accessible.

One more curveball here, at the end of the show. It occurred to me too during this show, over the years of the Changelog we’ve had a blog, we’ve had this podcast, we’ve got an email, and we have talked several times about open data, public data, being open sourced on GitHub, and it now occurs to me that all of that effort can now be imported, either by way of GitHub, but also just directly into BigQuery.

So if you are out there and you have got a data set you’ve open sourced on GitHub, go ahead and go to BigQuery. Put it there, make it public there. That way people can actually leverage it, because I can’t even count on my hands how many times we’ve covered open data in all of the ways we’ve talked about on the show today, but that seems like, you know… Putting it on GitHub is great, but then making it useful, not that GitHub isn’t useful, but making it useful, is putting it on BigQuery and opening it for everybody. That to me seems like the cherry on top.

Obviously we’ve got a couple of links we’re going to add to the show notes. We’ve got this announcement, obviously, between this partnership and the GitHub data set being available in this new way. The blog post being out there, we’ll link those up, so check the show’s notes for that.

I just wanna say thanks to the three of you for one, your efforts in this mission and caring so much, but then two, working with us to do this podcast and sharing the details behind this announcement, because we’re definitely timing this, the release of this show, for all the listeners, right around, if not the same day, the same timeframe or maybe the day after. There’s been a couple of posts already shared out there, so I’m not sure exactly on perfect timing, but we are aiming for this to be right around the same time, so the announcement at CodeCon for GitHub. We’re trying to work together to go deep on this announcement, share the deeper story here and obviously get people excited about this. I wanna thank you for working with us on that. It’s an honor to work with you guys, like this. That’s really all we wanted to cover today.

Listeners, thank you so much for tuning in, check the show notes for all the details we talked about in this show. Fellas that’s it. Let’s say goodbye.

Alright. Thanks very much. It’s been really fun to talk in depth about the project, so thanks for having me on.

Thank you very much. I loved being here, I loved being able to connect to everyone here at the Changelog.

Thanks for having me here as well. It’s been a good conversation.

With that, thanks listeners. Bye!

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

View all episodes

Player art