IPFS (InterPlanetary File System) with Juan Benet (Changelog Interviews #204)

All Episodes

Juan Benet joined the show to talk about IPFS (InterPlanetary File System), a peer-to-peer hypermedia protocol to make the web faster, safer, and more open — addressed by content and identities. We talked about what it is, how it works, how it can be used, and how it just might save the future of the web.

Changelog++ members support our work, get closer to the metal, and make the ads disappear. Join!

74 minutes
Recorded May 21, 2016
Published May 21, 2016
Download (71MB)
Transcript
🎧 45,765

Featuring

Juan Benet – Website, GitHub, X
Adam Stacoviak – Website, GitHub, LinkedIn, Mastodon, X
Jerod Santo – GitHub, LinkedIn, Mastodon, X

Sponsors

Toptal – Join the best, or hire the best engineers and designers! Email Adam (adam@changelog.com) for a personal introduction to our friends at Toptal.

Linode – Our cloud server of choice! This is what we built our new CMS on. Use the code changelog20 to get 2 months free!

Notes & Links

📝 Edit Notes

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Alright, a fun show today, we have got Juan Benet on the show. Interplanetary, Jerod. We almost wanted to open this show with a fun song. This is a topic you brought up, IPFS. Why was this on your radar?

Well, I think first of all it stands for the Interplanetary File System.

Right.

Great name.

Right.

It catches your right there. You know, a permanent web, just kind of an audacious goal. It seemed cool, it seemed kind of tantalizing and yet I didn’t get it exactly. So, just very interesting.

I think, Juan, you may have missed just slightly on the name because we would have gone with Intergalactic File System. Then you could have hopped on to Beastie Boys chain and have intergalactic file system - file system intergalactic… [laughter] But interplanetary just doesn’t quite fit right…

Yeah.

Do you feel like that was a missed opportunity?

Yeah, definitely. You know it’s funny that you mention it, because Intergalactic actually is technically a better name for the original purpose of the name. The name comes from… It’s an homage to J. C. R. Licklider who came up with the concept of the internet. The internet, believe it or not, actually stands for the intergalactic network. That’s what the internet stands for. The IPFS is meant to be the file system for the intergalactic network, and yeah, Intergalactic File System might have been a better name. The original name was GFS (Galactic File System) but then that clashed with a whole bunch of file systems called GFS.

You guys have a pretty good name out there, people are interested, but you know, it might not be too late if you wanna hop on that. I don’t know if IGFS.io is available, but worth checking into. I guess enough about that. Let’s get to know you a little bit. We like to hear about the origin stories of not just the projects that come across our radar and come on the show, but the people that are bringing us those projects and why it is that you are, you know, somebody who is involved with IPFS, and where do you come from to get to where you are here today. Can you give us your origin story and tell us where you are coming from?

Yeah. The origin story! I don’t even to know where to begin. I think, perhaps the most relevant thing to mention is that I pretty much grew up in the internet, so most of my thought has been learning things through Wikipedia and learning things through books online and all that kind of stuff. Certainly, of course, I went to school and all that kind of stuff but, I very much am a product of the internet generation. I tend to think about the world of bits, often more than the world of atoms. I, for a long time, have been very interested in how information moves around the network, how distributed systems work, how to make information more reliable and usable to humans, and really I have come to look at programming as the ability to create superpowers. Not just to have a superpower as a programmer, but to also be able to create superpowers and gift them to other people. Like when you write and application, you are really creating something that becomes this powerful thing, kind of like a magical item that you then give out to other people. You can give it not only to individuals, but you can give it to billions of people on the planet all at once, and that’s huge, right?

That’s huge, yeah.

If you think about the people making Wikipedia and how much of a valuable contribution they made to humanity… And that’s, you know, a superpower that you can give out to everyone. I tend to think about that kind of stuff, how knowledge grows, how we can build better, and how we can make these super powers more resilient. How can we make sure that when we give out the superpower, you are not accidentally making people depend on something that may go away?

More concretely and more grounded, I studied distributed systems, I studied computer science, a lot of both theoretical and applied work - not just building applications, but also thinking about them more deeply, but also not just lost in abstractions. Having to build something that is useable to regular people helps you translate really good ideas from research all the way down to something that is valuable and usable to average people on the internet that may not even care about the underlying things. At the end of the day most people, when they use the internet or the web, they are not thinking about how information moves, they are just manipulating – they’re pressing buttons on their computers and clicking on things on the web and learning how to use those interfaces, and so giving people good metaphors for manipulating digital objects is a big part of the whole thing. How can you make contributions that are good theoretically and good from where distributed systems theory is going, but also expose the way to manipulate and create value directly to the user in an understandable way.

This shows something in interesting little bits and pieces of the interfaces, for example, how mail clients will operate, when they refresh, when they download new mail, when you know that a mail is being sent, when you have confirmation that somebody has to write something, read receipts are a very interesting little thing that… It’s actually a very nice distributed systems problem that can help change how people communicate.

You say you grew up in the age of the internet. To me, I get that, but I don’t get that because I am 37 and I didn’t grow up in the age of the internet. Having the thought process that you just shared, you had to get that from somewhere, so I am kind of curious… When we have people on the show we are always interested to find out what it was that got them into programming. What hooked them? Sometimes its games, sometimes it’s cheating at a game, sometimes it’s doing better at math… Who knows what – but something got you into software. What was that for you?

Definitely games. I was born and grew in Mexico and I moved to the US when I was 15, and I was playing video games from an early age. Lots of RPGs, for example, and I got really interested in making games. Also, I think the direct reason I learned how to program was that I was part of like an online guild, I think it was Starcraft and Warcraft. We needed somebody to create a website, so I was like “Fine, I guess. How hard could this be? I will figure it out.” That just exposed me to making websites and programming, and that was like the opening of the rabbit hole. I think I must have been 12 or 13 at the time, I don’t know. I was pretty young. Didn’t start as early as some of the other people out there did, and for a long time I was just kind of looking up things, copy pasting, not really understanding what I was doing; a lot of trial and error. Kind of like the early version of stack overflow programming, but over time I started… It wasn’t, I guess, until I went through college that I got a really good grounding from a theory perspective of how computation actually works, what’s really valuable and useful and good ways of thinking about it and so on. I think it was hugely valuable to have formal training and understanding. I think you can definitely self-teach a lot of programming and how to make applications and all that kind of stuff, but to really understand the deep ways in which these applications behave or how large systems scale and all that kind of stuff, it is very useful to have a formal grounding. That doesn’t mean go to school or anything, it just means that you can read a textbook, you can read… The point is to study and I think most people don’t get - at least when I was learning - that wasn’t as accessible on the internet. I think it’s changed, I think there are a lot really nice tutorials now and things like EdX or Coursera that do give the experience of a more theoretical class .

The distribution mechanism of how we educate around software is changing or is fluid, but the education itself is still just as important as it ever has been, especially if we don’t want to be doomed to repeat the failures of the past, which tends to happen when you don’t know about the past. So you got the education, you were interested in computer science and you learned the underpinnings, so to speak, and now here you are leading a group of people, coding this new hypermedia distribution protocol. Can you tell us about IPFS - where the idea came from, how it started, the genesis story of this project?

Yeah, so the genesis story is a bit long… Well, not necessarily long, but there are a lot of things that came together. On the one hand, I was always interested in distributed systems, that was my focus when I was in school. I was very interested in peer-to-peer systems. I was always very interested in multiplayer games and things like BitTorrent and how you could build very nicely scalable systems by sharing the resources and bandwidth of different peers in the network. And an annoying thing about studying networking in a university was that they did mention things like BitTorrent and Skype and so on - that definitely came up, but it came up in a very cursory level; we kind of just discussed it a bit. We didn’t really take up all of the improvements that were brought through those technologies into consideration as much, and it took me a while to understand why. The reality is that a lot of these systems are kind of special purpose. The contributions are pretty specific, and to get something working really well for that one use case, but it doesn’t translate to nice libraries that people can use for a bunch of other stuff. You actually have to work a lot harder to get that working. You make nice interfaces and nice libraries for a much more general set of use cases, which is what people like teaching, or it makes it relevant to teaching and relevant to apply in a broader context. You would have to work a lot harder for that.

Anyways so that’s one avenue. Another avenue was that I wanted to… I was always pretty dissatisfied with how the web worked in terms of, you know, this notion that I have to host a web server somewhere, even to do something as basic as just transmit a set of files. I was like: “Why can’t I just publish this data, and as long as people are interested in resharing it, have it work on the browser just fine and not having to host my own web server?” That’s another thing.

I was interested in BitTorrent-like use cases for caching and distribution of content. I was actually pretty excited - we were mentioning the Warcraft and Starcraft earlier - Blizzard was actually one of the only companies to use BitTorrent in a meaningful way, in their distribution, at least publicly. There might have been others that did as well… But it would help solve big problems with their updates. I remember the days when they had all their distributions be centralized and, you know, downloading a patch for a game took forever. It was also partly the modems that people had, but also just their servers were pegged. So once they moved BitTorrent it worked a lot better and faster and much nicer, and that’s sort of an example to prove that even when you are a large company and have a lot of money and so on, you can still gain a lot of value from peer-to-peer distribution systems. That was a nice example, right?

Skype was another one for me that really was sort of a fantastic, shining example of the value you get by helping interface and network people in the world, but you are not really an intermediary that they are piping all communications through. I think nowadays Skype does intermediate all of your communications, but that’s a whole separate story. I think it has more to do now with the difficulty in connecting people peer-to-peer. It actually is pretty hard to open a pipe from one computer to another without having intermediaries. There are a whole bunch of problems, like NAT traversal and so on.

That was actually another avenue of this -I was really frustrated with how hard it became to program distributed systems simply because the network was not as nice as IP gave us. So the IP gave this really, really nice network where everything was addressable, any computer should be able to talk to any other computer and then NATs and mobile phone networks and a whole bunch of other things came in to ruin the party. They made it pretty difficult to open a connection from one processor to another. Also browsers, right? Like you can’t open a socket from a tab. That’s, of course, a big important security feature, but there are many cases where nowadays applications on the web probably should be able to dial out to anything else. You know, I think the model changes. I think the computational platform of today is more about – the boundary between the browsers and the OS is always shrinking, and I think at some point we would want to be able to make that possible.

Anyway, all of these things were brewing in my mind. I guess another strong influence was I did a lot of studying of different kinds of distributed file systems. These are things like Plan 9, for example, which came out of Bell labs and had a fantastic set of file systems. It had 9P, which is a really cool protocol for modeling resources in the network - different pieces in the file system, you use the same path name to do everything. Venti and Fossil were two examples of file systems. SFS was another file system that was a huge influence. There were a lot of them, and they were all pretty interesting. I was always a little bit annoyed with the divide between file systems on the web. To me it would be really, really nice to drop into the terminal and be able to just manipulate the web directly, so mounting. We tend to use Wget, and Chrome, and so on, but imagine that the web was just a directory in your file system and you could browse through it and read and write through it.

Yes. I think, zooming out a little bit, it’s easy to have maybe that perspective now, especially someone like you who grew up in the web, whereas Jerod and I are a bit more of a dinosaur compared to you. I would say that we didn’t grow up in the web. We are older of course but, you know, we grew up in the age where you joined the web… Like the nodes began to trickle in, so to speak, and the web grew and grew and grew, and now it’s this big thing, and so now it can be easy to look at it now and say, “Okay, the network is already there. Here’s how we make it better versus where it came from, which was small and it got big.” So I can see how you can look at it and say, “Here it is, let’s make it better now that it exists”, but you had to build to the point where – putting a file server onto the web and stuff like that. You sort of had to stake your claim or put your flag down, so to speak.

Right, I think it’s like a matter of perspective in our generation, which is probably just one up one from yours. Not dinosaur level.

Okay, sorry. My bad. [Laughter]

It’s like you saw it come from nothing to what it has become, and so we have seen that change, but we are not “web native” in terms of growing up inside of what it already was. And so from your perspective, I don’t know… I don’t wanna say it’s always been there, but you natively understood the web and so you’re seeing how it could be so much better, whereas from our perspective it has already gotten so much better from nothing. So it seems like sometimes it takes the next generation to reinvent things.

To point out the problems.

Yeah. Anyways… So you decided to create IPFS. Can you give the quick high-level elevator pitch? We’re gonna dive deep into it after the break - how it works, the problems you’re trying to solve and all that good stuff, but if you had to do like a 30-second “This is what IPFS is”, what would you say?

The IPFS is a new way of moving around content on the network. It’s a protocol with the goal to upgrade the web and make digital information have more permanence, be able to work offline more, be decentralized and move around faster in general, so use as much of the power of the network as possible and change where the points of failure and points of control are. There is a lot wrapped into it. At the end of the day, it’s just software. It’s just a new protocol for how computer programs should exchange data, so it’s like HTTP in that way. But it’s a very different design that borrows a lot of great ideas from other distributed file systems and version-control systems like Git. It models all content as content that’s linked through content addressing and hashes, and uses that as a way of getting much better security properties and a much better distribution model. There is a lot wrapped into that. At the end of the day, it’s about making the web better, making the web faster, safer and more secure. That sounds really nice in the high level, but it’s how it’s done in the details where the IPFS really shines.

Let’s take a quick break and then we will dive into how it shines. You mentioned that it exists to make it faster safer and more open. In the context of how IPFS works, I think we should keep those three things in mind and maybe as you tell us the different aspects of the protocol - and I guess that protocol is the right word for it - why this is faster, why this safer, why this is more open than what we are currently using. But before we go to the break, just from the networking level, where does this fit in? Is it at the IP layer? Is it above IP, at the application layer? Where does it replace?

It is above IP, and it is below the application layer. It complements and potentially replaces HTTP. Think of it as a different protocol for web browsers and applications to use to communicate with each other. It doesn’t exactly fit in terms of the OSI nice network layering model; the actual layering is much more complicated than networking groups would let on, but it fits there, it’s replacing the HTTP layer.

I think that it helps just for all of us to be on the same kind of framework of where we see this fitting into how computers communicate, so at the end it’s very helpful. Alright let’s take that break and we’ll talk about how it works in just a minute.

* * *

Alright, we are back and we are talking about the Interplanetary File System, which by the way is still fun to say.

I love saying that, so awesome.

I’m gonna keep saying it. We all know Adam’s a big space fan, so I’m sure you’re all on board for this name.

Oh Yeah. Totally, dude.

But it is a mouthful. So IPFS - its goals are to change the way we communicate with our computers using peer-to-peer distribution protocol, aiming to make the web faster, safer and more open. Juan, you said that the way it shines is really in the details of how it works. Sounds like you have a lot of education with regard to past file systems, even current file systems, as well as networking protocols, and so you have put together this gem which people are getting quite excited about - we’ll talk about that real soon, but can you open it up for us and kind of give us a look inside IPFS? Give us an insider look of how it is all put together and why it is faster, safer and more open.

The core principle underlying IPFS is to model data and link data using causal linking. This is an idea that goes way back to people like Leslie Lamport and others in distributed systems that really had a good framing for how to move around data. But more recently, I think distributed version control systems like Git and Mercurial and so on prove to us how valuable it is to model data this way. They weren’t the first systems to do it, there were others before, but I think they were certainly the most widely used. The same fundamental property that underlies Git is same is the property that underlies things like Bitcoin is the idea of linking objects using hashes. This is both causal linking, meaning that one object is ordered after the other. You can say that when you link something by cryptographic hash, the object that is linking to another has to always come after the other; it orders them in time.

The other piece of it is that by using cryptographic hashes you can verify the content. So if I have a link to an object or a file and that link has a cryptographic hash, it means that I can find that file anywhere. I don’t have to go and ask any specific location or authority for the file; anyone can serve me that file. And I can check that it is the right file because I can hash it and I can verify the hashes match. That is an organizing principle for the entire file system that you can build on top it.

The kernel on the inside of something like IPFS - and other systems not just IPFS - is that if you center on this as the main way to model your data and link the data, then you can make a lot of problems easier. You can easily reason about what content came before what other content; you can easily reason about making sure that the content is correct and valid, you can authenticate the content, you can verify that it is correct and you are free to now accept it from anyone in the network. You no longer have to go to specific web servers. You can really get it from any other computer. You can also not have to be connected to the internet, actually. You can be in a different network that is separate and using and manipulating the exact same set of links.

The underlying principle of linking something by hash - we call it Merkle linking, and it comes from Merkle trees. Merkle trees are a data structure that was invented by Ralph Merkle, a very eminent cryptographer. Ralph Merkle has done a lot of other amazing things. Perhaps his most famous contribution was called Merkle puzzles, that proved that you could establish secure communications with each other in the clear. This was before public/private key pair and so it was a big, important contribution. This idea of merkle linking through Merkle trees stayed buried in the cryptography community and the low-level systems community for a long time, probably because it was patented, I think people were more reluctant to use it. But I think the patent has expired since and then it began to be used all over the place, in systems like Git and so on.

This is what gives rise to the nice distributed systems properties. When you think about a Git object, you have a SHA-1 hash that you can use to address the commit or address the file or the directory and whatever, and you no longer have to trust the network to provide the correct content to you. You can reason about the history. You can even find out about the server and find out that it has been compromised because it is serving you some other completely different history. Or maybe it has not been compromised, but people did rewrite history and you can tell that that is happening and you can be selective about the changes that you take in.

So that fundamental property, which again, to restate it, you’re just linking objects with hashes. You are embedding into one object the hash of the other. This gives you a way to tie up content causally. So if one object gets updated, then all of the links it to have to change, and so on. And this gives you the ability to verify and validate the content and to also content-address it. There is another leap there, which is you have to also consider that these hashes may not just be a good way to verify the content, you can also use them as a way to address the content in the links themselves. You can put it in a file system or an address bar or something, and ask for something by hash. This is also an old idea. It has been used in many systems.

But by using these simple abstractions and piecing them well together, you can build a distributed system - a distributed information system if you will - that can move around content in a much safer way, because you can verify all of it. It’s faster because you can often times check caches that are local to you - it could be in the same machine, it could be in a machine close to you physically or it could be in the network that you are in, not even having to talk to the internet backbone and so on. It just makes information distribution faster, and allows you to reuse the bandwidth of all your peers. You no longer have to trust others, you can ask them for something and you can verify that they are giving you the right content. All of this falls out of the fundamental idea of Merkle linking.

You also say here on the website it combines the distributed hash table that you are talking about, with incentivized block exchange - which I would like you to kind of unpack for us - and a self-certifying namespace. So let’s start with incentivized block exchange. What does that mean?

Yeah. This is a concept that comes from BitTorrent. One of the improvements of BitTorrent over previous systems was that it modeled data distribution as an incentivized exchange. This means that if you have a bunch of people trying to download a torrent, then it is better for the swarm if people exchange pieces of content that each other needs. This is usually referred to as the tit-for-tat model. It’s not a perfectly modeled tit-for-tat, if you ask people, in theory; the incentive structure is a little bit different and there have been better proposals since then. But the basic idea is you say, “Hey, there’s a lot of peers in the network that have content, and anybody can provide the content to you.” Select between those peers that are likely to give the content to you, and that becomes more likely if there is an incentive structure there, meaning that if I have pieces of the file or I have pieces of other files that you are interested in, we can exchange those. That way you align the network so that you share the bandwidth resources. So instead of just supporting leeches that are only downloading and not contributing to the network, you get the distribution to serve. In a sense not only does it pay for itself, but your help load balance the distribution.

This isn’t perfect, because there are a lot of models where you really just wanna transmit data out and you don’t really care about people helping share it or other cases where maybe it’s something really big and the people that are distributing it actually wanna charge money for it, or something. This is something that we took into account when we designed the protocol called Bitswap, which is sub-protocol of the IPFS, and this is what we call the block exchange. It models the idea of distribution as kind of like a data barter system where I give data to you, you give data to me. I take into account how much data you have given me in the past and it makes me more likely to want to give you stuff in the future if you have also given me stuff as well. If our data sharing relationship is profitable, then I’m more likely to give you stuff in the future. There is a whole bunch of other cases where, you know, maybe I am new to the network, people should give me content. Or maybe I don’t really have anything that people are interested in, but you still have to take into account. Here in the standard HTTP model I am just gonna distribute content also works where you can default back to that kind of thing. It’s meant more or less as an optimization of the network than a hard and fast rule that you force networks to always distribute stuff. There will always be leeches in the network that you have to take into account, so it’s like you are somewhere in between.

Another concept in the IPFS is the self-certifying namespace. Can you tell us a little bit about what that is?

A self-certifying namespace comes from an older file system - not that old, it’s the early 2000s - called SFS. That was the self-certified file system. The basic idea is that when you think about naming on the network, this is the problem of assigning an identifier to some resource or content that may change over time. Something like foo.com points to an IP address, and if I change that pointer to point to something else, how do you know that it was me who did that and not somebody else? DNS employs some amount of security in terms of only allowing certain people to update records. There are also problems around security of how those records move and stuff, but there is a good amount of security there where it’s not like you can – if I own foo.com, then you can’t send records on that, right? That is the basic idea.

There are other naming systems that work in different ways where the way that registration happens and so on - maybe I have a public/private key pair and foo.com is bound to say a public key, and then any record signed by the private key corresponding to that public key can update that pointer.

Self-certifying records or self-certifying file system took it a step further and said, “Hey, wait a second. What if we relax the constraint and say that we don’t need these nice human readable names, and instead we can allow some ugly looking names? What if we just embed the hash of the public key directly into the name itself?” So you can imagine this unreadable name which is just a big long hash, but it’s just the hash of the public key. And that means that there is no need for a centralized authority validating or securing the namespace. It is in a sense a distributed namespace that cryptography assigns. This means that by just generating a public/private key pair I have a name now, and that name is the hash of my public key. It’s not a nice name, you can’t hear it or type it or anything like that.

We tend to think of names as a nice human readable thing, but the value here is that if you relax that constraint then what you get back is you don’t need a centralized namespace, you don’t need to talk to the internet to validate their name. As long as you have the records and they are assigned correctly by the corresponding private key, then you are good to use the value. This means that you and I can be in an IPFS network that is separate from the entire internet and I can create a public/private key pair and I say, “Hey, I am gonna update content, and the link that I give you for that content is the hash of my public key”. Then I can continue to publish content there and you can can find it, and you can be assured and certified that it was only me that updated that content, and nobody else.

Another way to think about it is kind of like a Git branch. In Git you have immutable content, which are the objects that are all hash addressed and content addressed. On top of that you have these mutable pointers and these other branches, so something like master. Master is a pointer that keeps pointing to the latest head that you want to consider as master, and whenever you commit - when you say ‘git-commit’ - you are updating the master pointer to point to the new commit. It is the same idea, this is how we use self-certifying names in IPFS. There are pointers to the latest version of the content and this could be a version history, or it could be just one version of the file, or something - it doesn’t matter, you get to decide what that means. But it gives you mutability; it gives you the ability to change content in a completely decentralized way, where you don’t have to rely on any central authority whatsoever. This is a huge property, it’s a huge win.

You end up giving up on the nice human readable naming, but there are ways to add that back in later. You can add it on top, basically. You map human readability to these nonhuman readable name, that are self-certified. The reason it is called self-certified is that the name itself has the hash of the public key and that is all you need. If you have the name of the hash and you have the content, you can verify all of it. You do not have to ask any central authority whatsoever for validation. This means that you do not need CAs, you do not need a consensus based naming system like DNS, you don’t need any of that. You can just do naming on your own, peer-to-peer. It’s a huge thing.

This concept shows up all over the place. Lots of systems use self-certified naming. They don’t tend to credit it that much, and they don’t tend to refer to SFS, which was the original place where this showed up. But yeah, that is kind of where the idea came from and it’s hugely valuable, and I think people tend to underestimate how important this piece of IPFS is. There is a lot of challenge in making it scale and making it nicely usable and so on, but it’s an important part.

Well let’s pause here. When we come back we are going to dive into the practical use of IPFS, how that exists. So far you have described what seems to be as a bunch of standalone technologies and implementations, data structures, protocols, what have you. We’ll put it all back together and see how you can use IPFS, and then we’ll talk about who’s using it and what their building on top of it, because it is a file system, so the point is to build things with it. It’s not really the end goal, right? It’s a piece of infrastructure. We’ll take our break and after that we will discuss those things.

* * *

Okay, Juan. So far you have described to us what seems to be a bunch of interrelated yet separate technologies. Can you bring it all together? How does IPFS work? What’s the software packaging? How do you use it, how do you get started? Tell us all that good stuff, and the actual practical uses of putting this stuff out there and using it.

Yeah, so the architecture fits together in that the core IPFS node - you don’t think about it as a client or server, you think about it as a node or a peer in the network. We are trying to get rid of the client server mentality.

So you have a node, and this node, what it gives you is the ability to add or retrieve objects into the graph. The graph is - think of it kind of like a web, but these objects aren’t HTML, they are kind of like they are kind of JSON; it’s not actually JSON, it’s CBOR in the wire format. But they are kind of like JSON objects - they can represent files, they can represent web pages, they can represent version histories like Git, whatever, and you get to add objects here. If you add a file to IPFS, there is a whole bunch of tools that you can use around the IPFS nodes. For example, you can have a command line implementation, and so the command line tool can add a file. You can give an IPFS command, and your command line that says “IPFS add MyFile.jpeg” or something. So what that does is that it reads the jpeg and chunks it into a graph. This means that it will read the file and split it into a whole bunch of smaller pieces and then construct a graph out of it. You can think of this graph as kind of like the easiest case will be a linked list, but there are some other kinds of abstractions. The graph is a description of the file, and here you can chunk really large files this way, and it that helps version things. Then you put all of these objects that are represented in graphs into IPFS, into a local repository. Think of it a little bit like Git. There is some repository that your node can access where it stores the data.

Once the data is in there, the IPFS node is connected to the network, and that network - I will explain a bit more how it finds the network and so on, but it advertises to the network that it now has this new content added. You don’t transfer that content to anyone until the request it. This is different from what other people might expect about peer-to-peer systems, but the files don’t move unless you explicitly request them. This is an important thing, because it means you are only downloading and accessing stuff the you explicitly request. You don’t have to worry about people adding bad content and it somehow showing up in your node, that is not how that happens.

You can also add files through – this IPFS node can also expose an API; you can expose an API on an end point, and here you can use something like HTTP or you can use something like a socket. You just have some way of communicating with it either by command line or programmatically, and you add content to IPFS. So you chunk it up and you add it and you link it with these hash links.

Now the graph is in your node and other people can access it. So say that I get back a link that I can give to other people, and when I can give that link to other people or I place it in in application or something, when those other nodes try to access that link, they connect to the network, they ask the network, “Hey, who has this content?” and they get back a response of a list of peers. In the very beginning there may just be one, and then just contact that peer, your node, and retrieve content from that peer. From then on, when they have content they also advertise to the rest of the network that they can distribute it. There are interesting policy questions there, where you can make that optional. You don’t necessarily have to advertise content to the network, or the way you advertise may be dependent on the use case. Certain applications may want to have their own sub-network so that you are not leaking the content to anyone else. You can also pre-encrypt the content, so nobody… If people will end up seeing the content flowing by of something, they are not… Or they are kind of like crawling or aggregating content, they can’t read it. They just get this encrypted block.

That’s sort of how you use it. Think of it a little bit like Git, where you can add content to to a repository and now that it’s added, it’s accessible from any other IPFS node that can talk to your IPFS node. You form this peer-to-peer mesh network with everyone, and this is where DHT comes in to help organize how to find the peers and access the content, and all that kind of stuff.

There is a whole bunch of interesting peer-to-peer protocols that can come in here. In reality, IPFS sits on top of a sub-project that we’re writing called libp2p. Think of it kind of like a huge toolbox of interesting peer-to-peer protocols that are useful and valuable in various settings and use cases, and things like local DNS or web socket transport or WebRTC transports and so on, and they’re able to piece these together into a nice connectivity fabric that we like to term like the peer-to-peer network. Your IPFS node just sits on top of them and is able to find other IPFS nodes that have content, and they retrieve that content and now you can serve it. Long story short, the basic idea is from an interface perspective, you add the content IPFS and once it’s added, it’s only added in the node that you added it to, but then you can move that link or give it to other nodes and they can then pull the content and move it elsewhere, and now it’s distributed to more than one node. And all of those nodes can now help share it. So it’s a little like BitTorrent in that way.

Can you just put it up there and say “Give this to all nodes or any nodes”? Do you have to be specific around which nodes are you going to distribute through?

This is something we are working on and figuring out exactly how to do, because there are many different constraints here. The hard constraints here are that we can’t make it so that you by writing to IPFS somehow get to send content to other people, because that content could be bad. Imagine you have some illegal content of some sort and you add it to IPFS; that content should not automatically be sent to other people, it should just be on your node. And it’s only by other people requesting it that you move it.

Plus you could easily DOS their server if you just fill it with more content than the space they have. It just seems like there are a lot of bad things that can happen that way.

Yes, exactly. It’s kind of like a pull model. What you can do though, is once you add content, you immediately send a message to another node saying, “Hey, I just added this content”, and if you can have some authenticated agreements, like saying, “Hey, please replicate all content that I have.” Think of it a little bit like Git pull and Git push. Most of the functionality is pulling, and pushing has some authentication that needs to be in place. You shouldn’t be allowed to push to any arbitrary node, they have to sort of allow you to do this. Both of these may be your nodes, you just need to make sure that the system knows that that’s possible. So given some authentication, yeah, then you can push objects however you want and distribute them to other nodes, but then they’re sort of available. So think of it kind of like one massive BitTorrent swarm that’s moving around objects in one massive Git repo, and all of the objects there are accessible to your web browser, so that your web browser can directly fetch content from this repository of objects . So you can put images, put web pages, pit whatever, and you can now access them all through the browser.

Seems like you make it pretty trivial to build your own private Dropbox, in terms of you just build the authentication around which the computers can act as nodes.

Yeah. It’s authentication and some UI, like user experience stuff.

Yeah, exactly.

We’re playing around with some of that. We’re more interested in the lower level protocol stuff, but there is a file browser thing that we’re making. That’s pretty cool, actually. You can drag and drop files in the browser and it adds them, and you can view them and send them to other people and so on. There are a lot of interesting challenges around sharing links and encryption there that we are working towards. We don’t have all of that stuff in place yet. We’ll be doing that over this year and in the coming months, and so on. Different groups are very interested in this, and so right now we are focusing on getting the perf to be really good, and focusing on the public use cases, but all the private stuff is just around the corner, and just by adding encryption.

You know, I look at this… Anything that’s a file system - whether it’s distributed across all these nodes or if it’s just sitting on my little laptop right here - it’s a building block, it’s a part of a bigger system. So it seems like what you guys want to do is lay a really good foundation and have all these different aspects of things that you want to build on top of it, figure it out so that they’re possible, and then let people go nuts. What are some of the applications that you guys see being built on top of it? I just mentioned an idea of like your own personal Dropbox type thing. One thing that hit our radar recently was this everythingstays.com, which was an immutable and distributed NodeJS modules. It seemed like it was a package registry built on top of IPFS. What are some ways that people are interested in using it, or even possibly using it currently?

IPFS is meant to just interface with the web of today, directly. It’s meant to just kind of rebase the web on top of this better protocol for moving around content. We are doing a whole bunch of work to make sure that IPFS is accessible to people using web browsers today and that web developers don’t have to think about a new model, they are just doing the same kind of web applications that they are building today, but just on IPFS. So you can do pretty much anything that you would build on a web app now, on top of IPFS. Depending on how a content updates, you might think of it a little bit different, and depending on how you wanna do control, you might think of it a little bit differently.

Let me give you some of our concrete examples. You can do file distribution really easily - this means just add static files, any kind of static file delivery; CDN use cases and so on - that’s pretty easy.

The next thing on top of that is things like package managers. You mentioned EverythingStays - IPFS began its life as a package manager itself, so the original goal was to make a dataset package manager. So I’d add the nice versioning features that we have around Git and the nice distribution system of something like BitTorrent and make it usable for moving around scientific data.

Then I just kind of realized that this would be really valuable for the web as a whole, so I really just focused more on that. The thing here is that a package manager for code like npm or a package manager for binaries like Aptitude are all very similar, and when you add hashing to how you make all of those links, you can decentralize the whole thing. You can think of package managers as moving around all of these static pieces of content - whether its code or binaries - and you can address all of those by hash. So you can think of making a completely decentralized package manager on top of IPFS. In fact, IPFS makes it extremely easy to do all of this. We have one package manager called GX, that you can look at. It’s our solution for package management in Go, and we use it to build IPFS.

It’s pretty opinionated and in its early days still, but it is very exciting, so check it out and think about it. And of course, there are a bunch of things coming around npm, like EverythingStays and other systems.

We were doing one where we are importing the entire npm registry into IPFS, and still using npm as a centralized registry for the naming, but have all of the content be addressed by hash and distributed peer-to-peer, so that when you npm install, you can download the files from other computers that are near you. Image that you are in an office setting with 50 other people or something, and you are npm installing something and you know that you have downloaded this stuff before, or that other people in the same room have downloaded it before. There’s no reason you have to go out to the backbone of the internet and download it again. So you can dramatically speed up all of this, or maybe even make it work completely offline. Imagine that the connectivity in your office falls apart and suddenly and you can still install all these npm modules because you already have them, somebody has them in your office.

How do you know of versioning at that point? How do you know that you are getting the latest version? Is it too late to ask that question? I mean that’s what I think about when you say stuff like that.

It depends on the how the caching and the updating of the versions happens. One model here is that the registry, the index of versions, so how a name maps to a list of versions that have been published - that itself you can download in cache, and that’s not very big, so you could cache all of that pretty quickly. Maybe you can’t get the latest version that was published right now, but you can get the version that was published an hour ago, or right before the internet went down. So you can think of accessing data as not a strictly online procedure that happens in that moment, but rather this more asynchronous thing where everything is sort of more eventually consistent. That’s one way of looking at it. It’s not a strictly eventual consistency, it’s a different property, but…

And I guess the push/pull process provides the authentication to trust?

Yeah, exactly. So you have the hashes and they’re signing, for mutable things you can sign them directly. So npm can sign the registry and the update to the registry and distribute those, so you know that they’re valid. And even in another more decentralized way, then individual authors could sign them; the individual authors could it with our key, and you know it’s a valid new version.

I’m just thinking about trust in that situation. If I can bypass the backbone of the internet and trust the local network, even if it is an older version, what allows me as the user to feel comfortable, to know that, “Hey, I’m offline, but I can trust what I am getting”. That’s what I am thinking about.

Yeah, so you can do things there like… There’s a whole bunch of interesting challenges that are more application dependent on something like a package manager; what you would want to do is expose what versions are available, and you have to know that these are the only versions available in the network that you can see right now, but the new ones may have been published, so you can attach dates to that and know when they were published. So if you think that there might be newer stuff then you know whether to use them or not.

There are some interesting challenges there, but we can think about data in a more distributed sense and offline first. These are the same kinds of questions, by the way, that people were wondering about Git at the beginning. When Git was getting started, everyone was really worried. They were like “Wait, what do you mean I can just ask somebody else’s repository for the data? Don’t I have to go to the central server?” and the reality is no, you can make sense out of all of these pieces of information. The central server is really good to maintain the latest copy or to have some notion of what the latest value that we want to agree on this, but you can get the pieces of data from anyone. And even those updates can be distributed through peer-to-peer.

Cool, so yeah, package managers are another great use case. One really exciting use case that we like a lot is distributed chat. We have this IRC channel, #IPFS - come hang out- to communicate and so on, but we also would like to be able to chat when we are, say, disconnected from the internet. For example, if we’re travelling together and we are on a train, or maybe just in some poor connectivity location, we would like to chat, and things like IRC or even things like Slack and so on don’t work in that use case, because you have to connect to the backbone and all of the messages are sent through this backbone. But what if you can have a chat room that just works wherever you are with the peers that are around you? So we’re creating a thing called Orbit. Orbit is a peer-to-peer chat application. It’s all, entirely on top of IPFS, with dynamic content, and the way it works is that all the messages are individual IPFS objects; you have a message log that points to all of the data. You can think of it a little bit like a blockchain – it’s not exactly a blockchain, actually… It’s a better data structure, it’s called a CRDT. CRDTs are a class of data structures, they are amazing. I could probably spend whole days talking about them, and I highly encourage you to have a future talk and interview about CRDTs with some of the CRDT experts out there. It’s a really good way of modeling data, and IPFS allows and supports building CRDTs on top of it.

What does it stand for?

It depends on who you ask. It actually stands for a couple of different things. It could be for Convergent Replicated Data Type or Conflict-Free Replicated Data Type or Commutative Replicated Data Type. I think there is a different version for the ‘R’ too. They are all different words for expressing the same set of principles, depending on which one you use, the emphasis and the implementation changes.

The systems actually look a little bit different depending on what you call them, but the basic principles are the same and the constructions are isomorphic, meaning that you can build the same kind of stuff on top of each other and they will give you the same properties. What these mean is that imagine if Git had no conflicts, never had any kind of merge conflicts. You are used to, say, Google Docs. Google Docs uses a thing called operational transforms. This means that when you make edits on a Google Doc, all of the operations are guaranteed to never conflict. That means that they can commute or in the end converge. So they are all convergent, you can apply them in whatever order and you achieve that exact same result.

CRDTs are better versions of operational transforms, or at least you can think of them that way. It’s a different research lineage, but they are used for the same kind of stuff. You can do things like Etherpad type of data structures, but you can also do something much more general like a chat application, or even something like a whole social network or email, and so on. It’s a really striking new distributed systems type thing, and super valuable research that is just now being turned into applications.

So we built a whole chat client using CRDTs on top of IPFS, and it’s really cool. You can load it up and start chatting with other people on the IPFS network, and all of the content is moving through IPFS. A lot of people were wondering, “Hey, IPFS is really cool for static content, but what about dynamic content?” and yeah, we can do that too.

The secret of making it fast there is we use pub/sub. This the one piece that is not fully there on the public release of IPFS yet, but we are still working on the interfaces and how that will work. But, yeah, pub/sub, making it possible for some IPFS nodes to move around content to each other really quickly is a big part of making this work really nicely.

Going back to the office setting, imagine that you are talking to each other on your team chat, and imagine that the internet connectivity falls apart. You should still be able to talk to each other. You still have computers, you still have a network that works in the building. Why is it that you can’t talk to each other? That’s just a very silly problem. IPFS is meant to solve all these kinds of problems, decentralizing the web.

One of the fundamental problems with how we are using the web today is that websites and links in the web or all the content and APIs and so on, the way they work is that they force you to go to the backbone and talk to people in the backbone to make sense of the data. This creates this huge central point of failure; both a central point of failure and a central point of control. Those websites own that data, and if they disappear or they cancel the service, or they are just inaccessible because the links between you and them are failing, suddenly you cannot use that application.

This is a deeply unsettling problem. On the surface it’s like, you know, they are providing service and usually a lot of times for free, But sometimes you pay them and you know it’s a best effort service. So if it doesn’t work because there’s a major disaster or something, well tough luck. At the same time, most of our communications are starting to be moved through the web. Think about how you talk to your coworkers, or more importantly your family members. You probably use some chat system. If you were using this chat system and there was some disaster, or just a service falls apart that day, suddenly you can’t talk to them anymore. And now this superpower that you have, this amazing ability to talk to them really easily and quickly is gone. Like immediately, surprisingly and suddenly.

So we need to as engineers need to restructure how we build web applications to make sure that this is not a problem; that we build resilient and decentralized applications, so that these messaging platforms should be able to continue operating even in those cases. If the internet works, if I have the ability to have my computer contact yours, that should be enough to be able to communicate with you. This happens for messaging systems, it happens for web applications, it happens for chat systems in general, things like GitHub and so on.

You know, GitHub has been under a lot of attacks in the last couple of years. Last year it was taken down by a CDN problem because somebody injected some bad code into a CDN, which caused a lot of people to attack GitHub. Even earlier this year it was taken down again by some other problems. Suddenly in those times, a whole bunch of people were frustrated by the centralization of GitHub and said “Hey, why don’t we just decentralize GitHub and have it work over something like IPFS?”. Turns out that IPFS can help tremendously in this problem.

On the first hand, if a CDN was using something like IPFS, that initial attack vector would just not work. The attack that people did last year, of injecting some code into the CDN, which has not happened at all, because all the code would be certified and checked by hash. The second part is that even if you manage to attack GitHub and take it down, if you were properly decentralized, then other peers that have the content can help serve it. So it does not matter if you take one host down, other people should be able to serve the exact same content. And maybe it’s a little bit slower or something, but the important part is that the content is all there.

So this is one of the important parts of decentralizing the web. IPFS, in a big way, is becoming a big push to make sure that the web itself is decentralized. The thing is that there are certain problems when centralized websites impose a point of failure and a point of control. So if we use a better model of moving around the data, then we can save ourselves from these deep problems of the web. We can make it more resilient, we can think back around actually controlling more as a user where the data ends up, and who uses it and who has the ability to address it.

One way of summarizing, in a big way, what IPFS is about is imagine when go to find a book, imagine that people told you that the only way to you could find the book is by giving you a bunch of directions of how to find it at a specific library. And suppose that you live in New York and to find this book you have to go to San Francisco, and you have to go to a specific library in San Francisco and get the book there, and that is the only possible way of reading this book. That is really silly. Why couldn’t you just get an ISBN number or the title for the book and look for that book elsewhere, in a different library? That is what IPFS is about - it’s about making it possible to get the content from whoever has the content and making digital information move like physical information. You can get a copy anywhere. As long as somebody has a copy to give you, you should be able to get it and use it.

This has vast, deep implications for how content moves, how resilient the network is, how applications operate on top of it, and the points of control around the data. Imagine if I could give you link to… People use things like Twitter or Medium to publish a lot of their thoughts, right? This is a really valuable form of expression that people have, a bunch of important communication happens over these networks. So imagine that all of these services go away, and suddenly all of those post or tweets or whatever just disappear, or all of the links for these break. Even if you can download data or something, the links will break. But what if instead, when you add data, you could get a link directly to the data itself? Not going through an intermediary, not going through Twitter.com, but rather going directly to the tweet or the Medium post and being able to move that around without having to trust that these organizations will continue to exist decades from now.

Yeah. I was gonna ask you the question on what you had said in your talk at Stanford, that the future of web could be in danger, but it sounds like you’ve pretty much answered that by these examples of the danger of… The future of the web could be that without decentralization, we kind of give up control, as you said, to these networks, and whenever they decide to go away, whether it’s because there is an internet outage or a connectivity issue, or something more serious like a business issue - let’s say Twitter fails as a business and goes away, we’ve got all this collective effort that is all this expression as you had said, which is valuable, that just now goes away. Is that what you meant by the future of the web could be in danger? By the fact that if we don’t think about decentralizing content networks and data networks like that, that we could be giving up too much control, and there’s a way we can actually build in the security and control for the long term by leveraging IPFS?

Yeah, exactly. These are important concerns about how the data that we create and publish moves through the network, and how we address the data is a huge part of this. Today in HTTP, we address the data through a DNS domain name, which maps to an IP address - that means a specific location in the network. It means a specific set of computers usually controlled by one organization. Whatever happens means that that business could go away, that organization could go away, that service could be canceled… Think about how many services have disappeared, and you suddenly wake up to a notice one day that tells you, “Hey, this service is going to be taken down in a month. You have one month to take all your data and move it elsewhere.” What about all of the links that you gave to other people? Suddenly all of that breaks. So we are tired of that kind of model and we don’t think that that’s at all ethical first of all, or correct. There’s a whole bunch of concerns - you can’t force people to continue providing a service that they just can’t in terms of a business. That make sense. But there are ways in which we can model how to structure and link to the data such that it doesn’t matter if that service goes away. The data is still there, and the data can still be accessed and the data can still be backed up. That’s a big part here, making sure that the links don’t break, making the links be able to last in the long term.

Gotcha.

Yeah, a lot of this is part of the archival efforts as well, right? So think about being able to archive.

Yeah, but again to the way the network works is… If it’s built-in, you don’t really have to think about it, It’s just part of the way that the file system works. We could be in an age where five years down the road this is getting more widely adopted and more networks will use it, and it might even take over a larger portion of what we know as the web today, and it’s built-in. You don’t really have to think about data being lost or networks closing, or a file not being there, because it resolves no matter what given the protocol.

Yeah, exactly right.

Well, Juan, it was definitely a fun deep dive into this topic. I know that interplanetary travel is fun and so are file systems, so why not combine the two? I had to put that joke in there. And I swear, Jerod, I kind of wish we could play the Beastie Boys… They will probably sue though, which is a shame.

That’s fair use, you have just gotta keep it short.

Yeah, I gotta keep it short. Also I did some looking up that IGFS is – while it may not be suitable, there is still time to change. So you could change to IGFS.

[Laughter] Yeah, maybe for April Fools day next year, we will work around everything.

There you go! Well, Juan, anything else you want to mention before we close the show? We have about two minutes to close.

Yeah. First of all thank you very much for having me, this was a really exciting discussion. There is a lot of stuff to talk about. The project is really, really big. I want to give first of all one huge shout out to the entire IPFS community that is building this up. This is not really my project anymore, this is a project that everyone owns. It’s a huge, large project with lots of people contributing, lots of people making it happen and lots of ideas. I get to sit here and represent other people, but there are some amazing contributions from all sorts of people around the world. With that, our open source community is super open. There is open source and then there’s open-open source. We are like open-open source - you can come in and file bug reports and all that kind of stuff, but also tell us what you would like to see, what features you would love to have, if anything needs more documentation. Come and hang out on IRC, come and contribute on GitHub. IPFS will become what you make of it. It’s a big call to people out there to come join us and help remake the web in a much better and decentralized way. Everyone’s welcome and we definitely look forward to the project growing and so on.

I am looking forward to seeing some listener pick this up and email us back with something he created using it, and in that way the complete circle could be made, and then we can have them on our show, talking about how they leveraged IPFS to build the next big thing or whatever.

That would be fantastic.

That’s the best way to do it, right?

Yep.

So listeners out there, we thank you for tuning into this awesome show on rebuilding the web basically and admitting some danger that could be in our future if we don’t decentralize. So if something has been interesting you about networks, that this problem has been there but IPFS solves that problem, then go build it or at least think about it and share that back with us and tell us what you think. Juan, thanks so much again for joining us today and having this conversation. That’s it fellas, let’s call this show done and say goodbye.

Goodbye. Thanks, Juan.

Yeah, thank you very much. Bye!

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

View all episodes

Player art