Nick Sweeting joins Adam and Jerod to talk about the importance of archiving digital content, his work on ArchiveBox to make it easier, the challenges faced by Archive.org and the Wayback Machine, and the need for both centralized and distributed archiving solutions.
Featuring
Sponsors
Fly.io – The home of Changelog.com — Deploy your apps close to your users — global Anycast load-balancing, zero-configuration private networking, hardware isolation, and instant WireGuard VPN connections. Push-button deployments that scale to thousands of instances. Check out the speedrun to get started in minutes.
Timescale – Purpose-built performance for AI Build RAG, search, and AI agents on the cloud and with PostgreSQL and purpose-built extensions for AI: pgvector, pgvectorscale, and pgai.
Wix Studio – Wix Sudio is for devs who build websites, sell apps, go headless, or manage clients. Integrate, extend and write custom scripts in a VS code-based IDE. Leverage zero set up dev, test and production environments. Ship faster with an AI code assistant. And work with Wix headless API’s on any tech stack.
WorkOS – AuthKit offers 1,000,000 monthly active users (MAU) free — The world’s best login box, powered by WorkOS + Radix. Learn more and get started at WorkOS.com and AuthKit.com
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | Software moves fast. So keep up. | 01:06 |
2 | 01:06 | Sponsor: Fly.io | 02:29 |
3 | 03:35 | Let's talk archiving | 01:58 |
4 | 05:34 | Wayback Machine has had challenges | 08:51 |
5 | 14:25 | Starting ArchiveBox | 01:46 |
6 | 16:11 | Fahrenheit 451 | 06:36 |
7 | 22:47 | The internet is young | 06:58 |
8 | 29:45 | The time unlock | 03:15 |
9 | 33:00 | Sponsor: Timescale | 02:17 |
10 | 35:17 | Sponsor: Wix | 00:54 |
11 | 36:11 | Archiving for legacy | 02:05 |
12 | 38:16 | 2070 headline | 03:18 |
13 | 41:34 | How does it work? | 02:54 |
14 | 44:28 | Dealing archive and file size | 02:13 |
15 | 46:41 | Nick uses ZFS!! | 01:31 |
16 | 48:12 | Going mainstream? | 01:15 |
17 | 49:27 | Accessing credentialed stuff | 02:27 |
18 | 51:54 | Single or multiplayer game? | 02:53 |
19 | 54:47 | Running ArchiveBox | 01:42 |
20 | 56:29 | abx-dl is cool | 03:10 |
21 | 59:39 | Adam's confession | 01:31 |
22 | 1:01:09 | This needs to become a thing | 01:31 |
23 | 1:02:40 | Nick's personal archive | 00:49 |
24 | 1:03:29 | Sponsor: WorkOS | 02:50 |
25 | 1:06:19 | This video is no longer available | 02:53 |
26 | 1:09:12 | Seriously using ArchiveBox | 02:40 |
27 | 1:11:52 | Mostly cooking videos | 03:35 |
28 | 1:15:27 | An internet on the internet | 01:33 |
29 | 1:17:00 | Session replay for your archive | 01:45 |
30 | 1:18:45 | An index worth sharing | 02:33 |
31 | 1:21:18 | Adam is sold, Jerod is not. | 00:43 |
32 | 1:22:01 | Thanks Hack Club Bank (HBC) | 01:29 |
33 | 1:23:31 | Nick interviews Adam and Jerod | 04:24 |
34 | 1:27:54 | You as a model | 03:11 |
35 | 1:31:05 | Are we done? | 00:49 |
36 | 1:31:53 | Closing thoughts and stuff | 04:08 |
37 | 1:36:01 | ++ Teaser | 01:35 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
We are here with Nick Sweeting, a full stack software engineer in Oakland and founder of ArchiveBox.io. Nick, welcome to the Changelog.
Thanks for inviting me. It’s a pleasure to be with you all.
Pleasure to have you. You want to archive stuff. Let’s archive stuff.
Let’s be pack rats.
Let’s start with “Why archive?” I mean, isn’t that just a lot of work and no gain? Why archive stuff?
Yeah, it’s a totally valid question. I think for most people the answer is maybe you don’t have to archive stuff, and that’s okay. Archiving is sort of a curation role, and some people are drawn to it, and some people are not. And I think that responsible archiving involves some amount of curation labor. It doesn’t have to be a lot of labor, but it’s the labor of choosing what’s important and what is not. And that can be just for yourself, it can be for your family, it can be for your friends, it can be for your academic institutions… But it is some labor that you’re taking on by deciding to preserve something, and just acknowledge that and pat yourself on the back. And if you do decide to archive, keep in mind that it’s not just a one-time decision. You’re going to have to decide, “Oh, do I move this data from this hard drive to the next one when it inevitably gets old? Do I give this data to my kids? Will they care about it? Do I give it to a library? Where does it go next? What should I do if someone asks me to delete it, and they don’t want it preserved?” And all of those things are sort of things that you have to think about. But if you’re excited about archiving, don’t weigh yourself down with all of that. Just save one or two things and see if you like it.
When it comes to archiving the web, or digital artifacts - I’m not sure how broad ArchiveBox’s ambitions are - but I thought we had archiving the web kind of figured out. There was a whole group of people who were enthusiastic about it, and still are enthusiastic about it. Of course, I’m referring to archive.org, the Wayback Machine, and that entire operation… Which felt like the web’s archive was in good hands. And all you had to do is donate to those good hands, or support those good hands, and hope that everything continues as normal. But recently, it seems like they’ve been going through trials and tribulations. And I’m not sure the exact details of who and why have been attacking the Wayback Machine, and trying to take archive.org either offline, or somehow ruin it… But it seems like maybe that’s an assumption that is not well-based. What do you think about that?
I think archive.org is doing an incredible job. They’re tasked with a really hard problem of doing this labor that I just described, but at a massive scale, for the entire Internet. They effectively become moderators for the entire Internet. Because if someone doesn’t like the content that they’ve decided to preserve, which is basically everything they can get their hands on, they get personally attacked, and they have to take the flak for it. So it’s a really, really tough position that they’re in as the sort of centralized curators of everything. And inevitably, they’re going to get attacked by people who don’t like stuff. And I think that they’ve done an incredible job so far, but there’s limits to a central moderation team that has to be able to manage and defend every piece of content on the internet from attack.
So they’ve undergone attack recently. Do we know the motivations of these attackers? Is it simply we don’t know yet? Adam, do you know? Nick, do you know? I don’t know, that’s why I’m asking the question, in earnest. I don’t know the answer to this.
They’ve actually been going through a lot of stuff. I mean, they had not just like a DDoS attack on a situation, where you have somebody trying to take it down, or keeping the site offline… They’ve had a major copyright case loss recently, where they were trying to archive things that – I think we as society want these things to be archived, and like you had said, Nick, this might be part of that curation aspect to just us as humans wanting to preserve. Not so much to break copyright, but there was some breaking there.
[07:59] So there’s a point of breaking, I suppose, or a breaking point, with the Internet Archive, where you’ve got copyright concerns, things like that. They’ve had various versions of attacks that isn’t just simply an attacker or an attack vector trying to take it down. It’s beyond that.
I would say one thing about the copyright case, if you’ll allow me a moment…
Yeah, please.
Their stance is pretty admirable. I originally was quite worried about it and I commented online and was like “Yeah, why are they risking the whole Internet Archive to take this stance? It seems like they should spin out a separate company if they really want to fight the publishers on this.” And I talked to Brewster about it, and I’ve sort of come around now.
Who’s Brewster?
Brewster’s the founder of Archive.org.
Okay, great.
An incredible character. It’s been his life’s mission to make all of human knowledge available for everyone. And I think he’s doing a great job. But his take on it was that he’s personally wealthy from a dot-com era sale, and he wants to do good things with that money. And part of that is rebuking publishers when they start really crossing lines around content ownership. And the Archive.org is actually properly legally structured so that these things are isolated. He’s not risking Archive.org and the Internet Archive by doing this, by taking this fairly strong stance against publishers forcing content licensing as the only option upon e-book readers. So basically, publishers were saying, “We’re not going to sell you an e-book anymore.” And this effectively makes libraries lending e-books impossible, because you can’t re-share the license to an e-book. They want to charge for every view of the e-book.
And so libraries can no longer lend e-books… And so he just thought that this was an egregious line to cross, and he was like “Okay, as someone fairly well-off, who cares a lot about this and who cares a lot about the freedom of information access for future generations, I can afford to take a stance and lose sometimes on cases like this. And I think that this case needs to be very publicly fought, and won or lost. And it’s not jeopardizing the rest of the Internet Archive.” I think that that message doesn’t get out enough. So they did the right thing there.
And they have this software that does – it’s CDL. You may know this, Nick… It’s Controlled Digital Lending is what this program – it’s not just software, it’s a program they had to allow this. I wasn’t sure of the details of which books… I think it was mostly older books. But it was essentially ruled that it was fought in the Second Circuit Court recently, in September. That’s why this is so fresh in my brain; at least the details, to some degree… Basically, concluding that this practice of this controlled digital lending that the Internet Archive was doing - it harmed the publisher’s markets by providing free digital copies of books. And I don’t know those specific details, like which kind of books, were they new…? I mean, obviously, if they’re new, that doesn’t make any sense. But if they’re older, or it’s sort of like almost public domain, maybe that makes sense… But you know.
Certainly if it’s public domain, it makes sense.
Yeah. I mean, I think at that point you don’t have much of a leg to stand on in terms of the fight. But I’m for freedom of information. I’m not for freedom of information insofar as it takes away a corporation’s ability to control their own work, and their own financial destiny with the things they’ve helped create in the world as information. But there is a line there that at some point we have to adjust. And I applaud them for trying to adjust it.
Yeah, I think they broadly agree with not depriving publishers of content ownership. That’s not really the issue they’re fighting. They’re more fighting that the publishers crossed a line by forcing licensing as the only option for content access. And that that was not where the line was before. That they moved it, and this was their way of fighting back. And that there’s broadly been a sort of Overton window shift of what is acceptable content release policy in the first place. And the publishers have successfully moved that to licensing only, and you can no longer own anything. And that’s what they were fighting.
[12:08] So yes, they did cross some lines with the controlled digital lending, where they were not counting how many copies they lent out… And I think that they expected to get sued for that. I think that they wanted to take a fairly strong stance there by saying that the way that the publishers are releasing the content in the first place is unacceptable.
We can go 17,000 more layers deeper on this. There is an article on the EFF, or I should just say eff.org, the EFF website, Electronic Frontier Foundation, that gives a few more details. There’s four different publishers: Hatchet, HarperCollins, Wiley, and Penguin Random House. And the stance basically was that these libraries have paid publishers billions - I’m quoting, “Libraries have paid publishers billions of dollars for books in their print collections, and they’re investing enormous resources in digitalization in order to preserve those texts.” And they say “The CDL helps to ensure that the public can keep access to those full books that they’ve bought and paid for”, basically. It ensures the usage, digital versions of them, that they’ve already paid for. So it seems like there’s some details there, for sure. But they’ve lost that case publicly, recently… But again, it’s back to several different ways this central point is being attacked, whether it’s in the court of law…
Legally, or technically…
Yeah. Which brings us back to ArchiveBox, and maybe its need; the need for it to be distributed.
Yeah. I just think fundamentally that both should exist. I think having big centralized resources is awesome, because centralized moderation is effective. You can keep bad actors out, if you take a stance and you don’t get dragged down by politics too much. You can do a really good job, and you can provide an amazing free public resource for a lot of people, and that’s awesome. But we should also have distributed archives, that cover all of the things that the central archives can’t, just from a scale perspective. A lot of different people saving stuff on a lot of different hard drives is always going to be able to save more, and know about more content. Not everyone wants to report what they find to the Internet Archive. Maybe you want to save something without announcing to the world that you’re saving it. There are lots of reasons: political, personal…
Sure. So when did you start ArchiveBox? And what was the initial inspiration for that? What made you actually get the editor out and start coding?
So I’ll start with the initial inspiration… I grew up partly in China. My family moved when I was nine, and I did middle school and high school there. I had an amazing time. And I obviously ran into the problem of having censored internet. So we’d read news articles, and then 20 minutes later you refresh and it’s a 404, it’s gone.
The Great Firewall.
Yeah. So just for practical reasons, you get used to saving pages out of your browser, or screenshotting them, or making PDFs, just as a default, whenever you find something interesting, in order to be able to share it with people there. And so that led to creating a small tool called Bookmark Archiver, that I was just using to auto-download all of my pocket saved articles. And that was a side project for many years. And I’ve come back to it over time, adding features here and there… And then I used it – there was a funny security incident when Equifax got hacked. I used it to make a spoof site impersonating Equifax’s site, and got a whole bunch of viral attention for that. And I was like “Okay, this is just a random, interesting side project, and not actually what I care about working on…” But a nice thing to come out of that was a bunch of attention towards Bookmark Archiver, where a bunch of people were like “Oh, I would use this. This seems useful.” So I’ve been slowly chipping away at it, adding features over the years… And then I quit my consulting job a couple years ago, and decided to work on it full-time. And over the last year and a half I’ve been building it up full time.
[16:06] Wow. Some layers there, for sure… I was thinking about this - not sure if it’s a direct one-to-one, but have you read the book Fahrenheit 451?
Yeah.
Okay, you smiled. Nobody saw that smile. What made you smile about that?
[laughs] He’s read it.
Well, there’s a lot of interesting layers to that book that are becoming increasingly relevant… Which is kind of terrible, but I don’t know… There’s a lot of misinformation and disinformation these days, and it’s sort of at the foothills of where Fahrenheit 451 starts before it’s outright deletion of information as a public strategy becoming acceptable.
That’s my concern. Jerod opened up with “Why should we archive?” Is it a - you didn’t say a fool’s errand, but you’ve said that before in other cases, I’m sure, that you’d probably… You didn’t say that.
No.
What would you say then? Is it not important? What’s your –
No, I think it’s incredibly important. That that’s why I’m like “Let’s get ArchiveBox on the show.” And I’m a huge fan of archive.org. I think it’s a shame that it’s getting so much problems… And I think that if we can decentralize those problems across a bunch of people, that’s probably better. So no, I’m not against it, by any means.
Gotcha.
I don’t think it’s a fool’s errand. I do think it’s a hard problem, and –
Laborious, for sure.
…and expensive, and lots of stuff, which is why the software needs to be there. But go ahead, Adam.
I was not trying to say you were seeing something bad or good, but it just seemed like “Why do it?” was the question, in a negative light.
It was the question.
Like, “What’s the point?” Yeah, what’s the point of this?
I think it’s a valid question.
Yeah.
I think so, too.
But I think when we cross-examine the challenges which we opened up with for the Internet Archive, and then this book, or the premise of this fictitious book in light of today’s world, and then your history of living in China behind the Great Firewall, and the challenges that come from internet disappearing, essentially… Like, truth is – you can go online and see a price for something, and tomorrow that price changes. But unless you screenshot it or something like that, you can’t go back to that retailer and say “Hey, look, the deal should still be the deal.” They’re like “Nah, we just changed that price behind the scenes”, or something like that. Like, your only truth is the artifact you can claim, or that you have a hold of… And I think that’s kind of the premise of the desire to archive the internet, so that we can preserve it for years to come, but at the same time, just to hold true what’s true.
I think there’s one more public perspective that’s pretty common that’s maybe worth addressing around why is archiving worth it. A lot of people sort of have the valid idea that, “Oh, with AI tools, or with modern technology or better tooling over time, we can have our computers just sort of osmosis all of the content, and keep track of what’s important for us, and we don’t actually need to preserve the actual website the way I saw it originally.” Like, “I’ll just use a browser extension that sort of ingests it all into a model.” Or “Oh, they’re training models on the whole internet all the time anyway. Why do we need to save the original sites? Let’s just keep these models over time, and that’s good enough.” I think that that’s – it’s a reasonable thought, and that might work in the long run… But I think in the short run, we haven’t seen those models be accurate enough to recall all of the original content without hallucinating at all. And then, unfortunately, the subsequent models get trained on the output of the initial models. So it’s really important to keep those primary sources around for as long as possible, because our future kids’ kids’ kids’ kids might care, for historical purposes also, “What did websites look like?”, but also for contextual purposes. How is this content delivered? In what format? What ads were on the page? All of these things are things that future people might care about, that might not seem important now. That’s part of why archiving, this active curation, this active labor that I describe it as is important, is because you’re trying to preserve as much of the original historical context of the world around this piece of content at that moment in time, with the content. It’s not always just about the raw content.
[20:05] Right. Well said. And I don’t think the technique you describe, because of the way the large language models work - I mean, they are effectively compression algorithms… And so lossy by definition. I mean, they’re not lossless. Maybe eventually they become lossless, and so they can have both your compressed artifact and your original artifact perfectly pristine. Well, then they’re just archivists, aren’t they? And so we are still archiving. We’re just letting the machines do it. But you’re kind of letting the machine do it, right? That’s what your software does.
Sort of. Yeah, so I actually don’t take as much issue with the compression. I think all archiving is lossy, to some degree. I take more issue with the lack of perspective of the tool. I think that the perspective of the person doing the saving is almost as important as the actual record. Because if I visit a website in the U.S, and on the Eastern time zone I’m going to see a totally different New York Times homepage than if I visit it from Germany. Or if I visit my Facebook timeline, it’s going to look totally different to me than to someone else. So the perspective of the person viewing it is almost as important. And these models don’t have that perspective. They don’t record any information about who’s doing the saving, why are they doing the saving, when did they do the saving, what did they visit before and after? And so all of that stuff is part of the curatorial work of creating these archives.
Gotcha. So that’s something that’s unique to the web then, because of the dynamism of the documents… Because if we were going to archive ancient writings, maybe you want to know what cave this came from, and all the context you could possibly gather. But there’s not the perspective of the gatherer… Maybe they choose to exclude some stuff, or - you know, there’s censorship and things… There is a bit of an editorial to decide what to archive, but based on one person in Seattle and the other person in London gets two different web pages - that’s a really good point, but it seems like it’s almost unique to web.
If you go back far enough, I think you’ll encounter editorial adjustments more often. History is written by the victors…
Sure.
…and the victors are the ones who retranscribe it over the years. And so you’re essentially getting layers of delayed perspective added. I think if you look very closely at any sort of historical archives, the older they get, the more perspective is necessary, because those are each layers of decisions to decide to keep this around.
Fair. The documents don’t change though, right? Unless they’re literally changing them. That’s fraud then. So we’re not talking about fraud.
Hopefully, yeah. But then you get libraries of Alexandria, and you have to retranscribe things from memory, or oral history… Once you get to a long enough timescale, it all becomes layers of recollection. But yeah, you’re right. Hopefully, the documents don’t change in the 100-year timescale.
Yeah, the interesting thing though is that the Internet is fairly young in comparison to pretty much any other archived medium. It’s one thing to have an archive or a museum of paintings, or of art, or of different artifacts. The web is a uniquely – like Jerod said, it’s dynamic. But at the same time, the perspective of – we don’t know what’s important right now, until later. So it’s almost like archive as much as society might think is important, because we’re not really sure what is important right in this moment. We have to have a zoom out, which is time. The time is the perspective. 50 years from now, the world and the web - or whatever the web becomes, or whatever the web makes the world become - will be drastically different, for sure… Bad or good, we’re not sure. But today’s breadcrumbs, so to speak, may point us to why or how or what later on. Because the questions we’ll have later are unknown to us. It’s almost like the unknown unknown. It’s just archive it all as much as you can, and distribute it and protect it, so that we have the opportunity for the lookback.
Yeah. That’s, I think, a good strategy for a central actor, like archive.org. Their strategy is just archive literally everything they can get their hands on. You submit a URL to them, they’ll archive it. I do think that breaks down somewhat in distributed archiving, where the goal is slightly different, because you’re empowering individuals to save things that they care about.
[24:17] It’s a little counterintuitive, but actually recommending that people save as much as they possibly can tends to backfire, because they end up with massive, multi-terabyte collections that they just can’t handle, they can’t deal with. They don’t know who to send it to, and eventually they stop paying for hosting. So that’s why I really stress this sort of archiving as an act of curation line. It may get old, but for distributed archiving, it’s especially important. It’s especially important to recognize that the people running these are really contributing labor. They’re contributing public service to other people, and they should do it to the extent that they can sustainably do it. And if you dive headfirst into saving everything you possibly think is useful, I’ve seen many, many people burn out on archiving from that. It’s a fad, they’ll get into it a little bit, they’ll download 10,000 URLs, and then they’re like “Okay, I don’t know what to do with this. It’s too big to search, it’s too big to use… It’s kind of cool. Maybe I’ll send it to someone.” And it actually dies faster. Whereas if you empower people to archive what they care about, and sort of harp on that a lot, so that you make it easy to curate and tag and add a context… It’s the context that indicates why it’s valuable, and it’s a different strategy than a big library of Alexandria warehouse where you just store everything you possibly can. It’s more about having nodes of these curations of different groups, and these nodes can then start sharing what they think is important with each other, and through this sort of federated network of decision-making on what is important, you end up with the same average result at the end, of basically everything that anyone has cared about at some point being saved. But putting that whole responsibility on one person of “Oh, if you’re starting archiving, you must archive everything you possibly can”, I think it actually tends to backfire more than it does good.
I can certainly see that. So ArchiveBox is to empower individuals to archive that which they care about from the web; so this is a tool for downloading web pages, storing them offline in their own little archives, that you can bring them up and look at them again… You know, HTML, JavaScript, PDFs, images… The raw nuts and bolts of what puts a website together.
Is the end goal then, like, we all have our own little archives - is it like you described, and like ArchiveBox is somehow going to provide this Wayback Machine based on this federation of me agreeing, and other people agreeing? …which feels a little kumbaya, but would be awesome if we all agreed to share our little view of the world with everybody else. Is that the idea?
No, actually. So I don’t want archives to be necessarily defaulting to being public for everyone… Because, again, that’s not the role of this distributed archiving tool. It’s a great role for a library, but it doesn’t work as well for distributed archiving, because of cookies, because of authentication. Basically, one of the main selling points when you actually get down to it and you’re like “Alright, do I really run this tomorrow? Is it worth it or not?” is “Oh, I can save my social media, I can save stuff behind paywalls, I can save stuff that I have to be logged in to see.” Archive.org cannot save any of that, and they won’t take it. Or they’ll upload it for you and they’ll hold it privately for you, but you won’t be able to share it with anyone, because they don’t want your cookies, right? They’ve archived your cookies, your login sessions, all of that.
So a lot of that content is kind of unshareable until you die, or stop using those accounts. And so it gets really tricky. Like, that’s the main selling point of saving stuff locally. If I start adding features of like “Oh, share your archives with the whole world”, most people don’t want that. They’re saving their Facebook photos, they’re saving their – yeah, the news articles and stuff they read, but also a lot of their own personal browsing history. They don’t necessarily want to share the URLs only, and they don’t necessarily want to share that snapshotted page content. But it’s important for the longevity of humanity and this information for it to be shareable eventually. And so I think very carefully about sort of different ways to tackle that issue. It’s a really human issue, it’s not a technical problem. Do you have time unlock? Do you try to incentivize people to donate their archives to a public collection by providing free hosting in exchange for them releasing the information? Do you have scrubbing tools that try to go through and scrub all the sensitive information? If you do that, where do you stop? Because you are tampering – you know, archivists try very hard, as you were saying, to not tamper with the original documents. But if the original document has someone’s personal email and username and password in the HTML somewhere.
[28:45] There’s a tradeoff at some point. You do have to scrub that for it to be useful to other people without being harmful to the original curator. Curation is an act of labor. We shouldn’t punish those people doing the curation by spreading their social media logins to the world.
So it’s a very delicate balance, and I think that the answer is there’s no one permission setting that gets pushed on everyone, ever. This tool is never going to force everyone to upload all their archives to a big, federated network. This tool is never going to force everyone to only have private archives and not be able to upload stuff to a big federated network. Instead, it’s going to give a range of options, and it’s going to be annoying to some people that they have to decide “Do I share this with other people or not?” But I think that that’s the right move for now, is giving the full spread. “Do I keep it local? Do I share it with my neighbors, who I know and trust? Or do I share it with everyone in a big, untrusted, scary world, where someone might use this content to hurt me later?” And every social app, network platform has to make these decisions when they first start.
For sure.
The Time Unlock is super-interesting, because we recently spent some time with Jordan Eldridge… I’m not sure if you’ve heard that episode, Nick, the Winamp era, where he had dug through different Winamp themes.
I love Winamp.
And he had found in these themes all kinds of digital artifacts, things that shouldn’t have been there… Because he has this Winamp – not theme; what are they called? Skins.
Skins, yeah.
He has this Winamp Skin Museum, which is really rad. And in that, he had found old pictures that people – it’s basically a compressed folder of files. And in there is the stuff you’d normally have for a skin, but then random things that he found in there. And he shared some of those. And we were looking at pictures of people from the ‘90s, and old audio files of like kids at their computer, recording weird noises… And it was just really enjoyable to kind of have that snapshot of the past. People we’d never met and never will meet… Sure, if we had seen it right after they had taken it, now it’s like almost a privacy violation, right? They’re like “I didn’t want you seeing that.” Well, you shouldn’t have dropped it in your Winamp Skin.
That’s right. Purposefully.
But over time, they’re gone, and old, or dead, or… It’s just like, the context is gone. There’s no fear there. And it’s really – for us it was nostalgic, but there’s lots of reasons why that would be interesting. So I like that time unlock option.
Like you said, maybe I donate my archive when I die, or every 20 years, go back 20 years, and those are now publicly available. Similar to how stuff gets declassified in our government. I think that would be really cool.
Yeah, that’s sort of what I’m gravitating towards as an initial carrot to offer - if you agree to time-unlock, then I’ll host your stuff for free as a backup. It gets dicey when I have to re-host content for other people… So the way archive.org works is they basically operate as a library. They’re a nonprofit institution, they don’t earn income from their hosting… They have a separate LLC that does some paid services, but it’s a separate LLC, and they’re basically not earning revenue directly off of re-hosting often copyrighted content.
[31:49] If I ran a public hosted service where I’m mirroring people’s content, I would have to either be a library like them, in which case I can’t accept payment for hosting at all. So this is the only way that I could offer to host people’s stuff. Or I have to figure out some other new legal system that hasn’t been invented yet to do this. Basically, you’re trying to make a business out of BitTorrent, right? It’s a very similar problem. It’s very hard to charge for this and not be legally liable for re-hosting copyrighted content. So there’s probably some middle ground where it’s a – people are buying an app that they are running locally, that they are operating, that’s connecting them to other people running this app. But I am never hosting public stuff unless I have a signed release from the archiver saying that they own the content, and it’s okay to be time unlocked. And then I’m taking on the central moderation labor of, unfortunately, delisting that stuff if I get copyright complaints. If someone sends a DMCA notice and says I have to take it down, I have to comply as a central agency. But the people running those individual archiving apps can still share it if they want to. Something like that. That’s sort of a middle ground option.
Break: [33:00]
I would be motivated to archive for legacy. What’s internet today for me is not the same internet of tomorrow for my kids… And so I think that would be where I would personally find some motivation. And I’m kind of hanging out in that motivational space because you’re describing – you know, archive 10,000 URLs, you get burnt out, and you sort of quit. And so the job of you is to instill the obvious software to do the job, but at the same time bootstrap and educate the people that you want to sort of clone and say “This is why it’s important. Here’s how you can use it for yourself. Here’s ways you can even share it with others that make it so that you stay motivated.”
Yes.
So I feel like an archivist or a curator is motivated by their own desires, but at the same time if I can’t show off my stuff, the things I think are cool, or have a purpose or a reason to do it, I will eventually become bored with the practice and just basically move on. I think for me personally, I would want an ArchiveBox for my future generations. And it’s not to be narcissistic, it’s my people, my closest people who I really care about in life. Sure, I care about everybody and I’m a kind person, but at the same time, family is family. I want my kids to know where I came from, what was important about me, and maybe it’s part of the podcast. Maybe it’s part of the WebAmp museum, so to speak. These little things that were cool to me, that eventually my kids can spelunk and be curious and explore and find new things, and reach back, and all that good stuff.
Or maybe they decide to donate it to a museum, and then the museum decides to bring a whole new life to it. Your kids have a bunch of interesting agency and choice that they can make. But yeah, that’s a great point. Legacy is a common attractor for individuals who want to do archiving. I’d say right now it’s an even split sort of between journalists, researchers, lawyers… Lawyers are the biggest category, to be honest, in archiving. And individuals who want legacy, or just sort of personal use, archival of their bookmarks, that kind of thing.
Imagine this headline in 2070. “Seemingly long-time digital pack rat, finally through family and legacy has had their internet archive, or their ArchiveBox donated, and it’s enabled this new technology to be the foundation of…” I don’t know, reaching for the stars here… But imagine that kind of headline. Somebody who was really archiving the good stuff, and they gave it to future society, and then enabled this brand new thing that is just super-cool.
Well, you also have certain creators through time, who were prolific, and they wrote way more than they published, for instance. And then that person died, and they became famous because they wrote such great prose. And over time, you’re “Wow, what if we had their unpublished works? What if we had their journal? What if we had their thoughts? We could mine those for such interesting insights.” Like Albert Einsteins, and such.
Yeah. There’s a delicate balance there though, because – so with any content that people create, they’re being vulnerable and sharing a part of themselves that they might not otherwise share if they knew that everything they shared was instantly public 100% of the time.
Well, I’m speaking of legacy, though. This is your foundation that you’ve arranged. We’re in the context of you saying that finally this person died, and their foundation decided to open up their ArchiveBox, for instance.
Yeah, that’s totally fair game.
Yeah. And then they probably scrubbed it first, just to make sure it’s not embarrassing, and stuff. And then the public benefits. That’s where I was going. Not just like “Hey, all your secrets are public now that you’re dead.”
Well, I was also meaning for people running these distributed nodes, I think it’s also important to sort of discourage the “Oh, archive everything you possibly see” mentality, because I think that would also kind of destroy the internet, to some degree. Part of the beauty of the internet is that there are pseudonymous spaces, there are anonymous spaces, and there are real name spaces. But you’re not forced to be one and the same identity across all of them.
[40:15] And so you get more vulnerability, more connection, more willingness to share things online that you might not have in person. And the threat of everyone watching is actually tape-recording everything they see 100% of the time. And even if they don’t decide to share it today, within 20 years, 100% of everything is going to be online, copied by everyone. I think that that is rightfully a scary concept for some, especially people who feel more threatened.
If you don’t experience a lot of threats online day to day, it seems like “Oh, that’s not a big deal. If my stuff is time-unlocked in 20 years, I’ll be fine.” If you’re experiencing a lot of oppression today, and you don’t think your situation is going to change, having all of your social media public in 20 years might not seem as attractive an option.
And I just want to acknowledge that there’s a range of privacy that’s needed, and there’s a range of respect that needs to be given to privacy from archiving toolmakers to acknowledge that we’re not trying to build the tape recorder for the entire internet, especially the private stuff. The stuff that requires cookies and logins. Because archive.org doesn’t have this problem. They’re not archiving stuff behind logins. But of course, I am pro-archiving in general. I’d love people to archive. I feel like these points don’t get harped on enough when people talk about archiving online, and so I feel like this is the right space to give them a little bit of air time.
Sure. So tell us how ArchiveBox works then, mechanically, as a person who might use it? How do I point it at things, and how do I decide? Just walk us through it.
Yeah, so ArchiveBox right now is a self-hosted Docker app, mostly, and a pip library. So if you don’t know what those things are, I’d say ArchiveBox is not for you. There are other apps out there that do a way better job of providing a nice user interface, a nice iOS app, and all of that is coming for ArchiveBox eventually. But right now, we’re a server that you run, like NextCloud, or Plex, or Home Assistant, that you set up on a little $5 a month machine. It’s totally fine. You run a couple commands, it takes five minutes to get it running. You have an admin interface, web UI, and you have a browser extension that you can use to submit URLs. Or you can just paste in URLs manually or drag them in from a spreadsheet, or your bookmarks out of your browser… There’s ways to ingest – most of the common ways that you would want to send a list of URLs to this. Then it goes through, and pretty serially - we can’t do too many in parallel, because you’ll get blocked pretty fast… So we just go through one by one. And for every URL, we save it in a ton of different formats. So the raw HTML, we’ll save a single file, which is an excellent way to get everything into one HTML file, including JavaScript and images and all that. Wget YouTube DL, so we rip all the audio, video subtitles out, video metadata, comments, photo galleries… Basically, every piece of content. ArchiveBox’s stance is to actually rip it out of the original page. We’re not trying to do the “Oh, preserve it perfectly in its original format thing”, because I think that that – even though I harped on before how important the original context is of a piece of content, honestly, it’s a really difficult technical problem. And so I’m going the other direction, where I’m actually trying to get the content out into its usable forms for LLMs, and for humans to actually use it.And so I don’t actually write it to this warc standard, which is sort of the internet archiving standard file. I think it’s a little bit unapproachable for most people who don’t interact with warc files on a day-to-day basis. And so instead, ArchiveBox writes everything as raw files to the file system. You get a normal PNG, a normal PDF, a normal .txt file with article text. You get JSON, you get just basically really simple common file formats that I think will survive for more than 100 years… And you get it all flat on the file system right there. You can just dig in and look at it. There’s no complicated binary formats, nothing like that.
[44:02] Yeah, so that’s generally how it works. And then you can set up scheduled archives that pull in stuff on a daily basis. You could archive your own Twitter feed, or Hacker News, or whatever you want. And then you can tag it, you can send an archive to someone else, you can export it statically in a way that you can share… And the distributed sharing between archiving nodes is coming. I’m working really hard on that, but that’s not out yet. So that’s how it works so far.
How do you deal with the – if it’s a flat file, how do you deal with file size, or archive size over time? I understand the reason why you’re doing that, because you want it to be preserved in a format that is accessible… Whereas warc, which I believe is W-A-R-C, right? That’s the file format?
Yeah.
…where it’s stuck in this other thing that may not be accessible. I don’t know. Zip files will probably be around forever, but these randos that might not be, which warc is not. But at some point somebody might be like “No, that’s not cool anymore. Regular PDFs. Let’s do that.”
I think warcs will last. So warc is actually a zip file. Modern warcs, like warcz, is just a zip file. You can add .zip on the end and uncompress it.
Okay.
I don’t think it’s too bad. Really, once you get used to them, they’re very easy to work with, and they’re quite standard. And I think they will survive for a really long time. I just want ArchiveBox to be immediately usable by the next tool that you want to consume the data with. I don’t want multiple decompression steps, and stuff like that. So for your concern about file size - yeah, it does take up a lot of space. It’s not as bad as you would expect, though. I’d say about 1,000 URLs take up, on average, about 5 gigabytes, with most of the methods enabled. So as long as you’re not saving only YouTube videos, you can expect – if most of your content is text, plenty of images still, but no massive, massive videos, because that’s what really skyrockets it quickly… About 5 gigs per 1,000 URLs. So 10,000 - not too bad. 50 gigs, you could probably stick that on a drive somewhere. As storage gets cheaper, that’s not that big of an issue.
For my big, massive archives that I keep, I use ZFS, that has built-in compression, and lately, fast deduplication. And so I like to solve those issues at the file system layer.
Oh, you dedupe, huh?
I’m experimenting with a new fast dedupe feature. I haven’t used it on the big, big archive yet, but it’s working well.
I usually disable dedupe, honestly. I mean, I don’t have a need for it. But I think if I was running an archive, I would probably want it.
Yeah, it’s one of the few cases where it makes sense. But specifically, the new, recently released, in the last few months, fast dedupe rewrite by… Is it IX Systems? …or another company stepped in and contributed a big update to it.
Interesting.
So it’s more reasonable now for people to run it.
Yeah. As I was asking that question, I was thinking, “Adam, don’t worry about it. The file system will do it.” So I was going to ask you what your favorite file system was, or what file system is beneath this thing.
I love ZFS.
I assumed you’d say ZFS, and I’m thankful that you did.
So am I. Otherwise, we’d have a fight. It’d be Adam versus Nick, and…
It’s not worth going there.
…it’s the wrong place to slap somebody.
Well, I can appreciate your taste in so many other things that I know that you appreciate ZFS. So there you go.
There you go. And that really - I mean, I’m a ZFS guy myself. That’s exactly what I would put this archive on. I would spin up a new ZFS file system, and I would let that file system do all the work of compression, dedupe, stuff like that that would matter, and let the ArchiveBox do its thing, which is what it should do. Let me, as the user and the curator of it, interact with the original file system, or the original file types, versus what the file system can do for me.
Yeah. One way to make that more accessible is I’ve added support for our clone recently. So you can link it up to a Google Drive, or a – like, a lot of people don’t have terabytes of storage at home anymore… And so letting people use their Google Drive as their storage I think is important. And then Google Drive - they’ll still charge you for every file, but they’re doing deduping on their side. Same with AWS, or all of them. I think that’ll get cheap enough over time that it’s not a big issue. I think most people are going to run into losing motivation sooner than they’re going to run into running out of storage.
[48:10] File systems, yeah.
Yeah.
How do you get this to go – well, I guess maybe the better question would be “How well used is this?” Like, how encroached is it – how much are people using ArchiveBox, and what would it take to make it more used, more adopted?
Yeah, so I don’t have analytics in the actual product. There’s only a few stats that I keep track of. So there’s 6 million Docker Hub pulls so far; 6 or 7 million if you include both repos.
That’s a lot.
The PyPI installs are sitting at around 70,000 a month, and the Google Chrome extension only has about 2,000 users. A lot of those are automated. People have scripts that auto-update their Docker container, or auto-update their pip packages… But I think it’s in the tens of thousands. Exact numbers, I don’t know. When people open GitHub issues, that’s a pretty strong indicator that they care enough to say something. And there’s thousands of GitHub issues and hundreds of contributors. And a few donations; not enough to make it a sustainable business model, but enough that I can’t ignore it. Lots of attention whenever stuff goes on Hacker News… So I know people care about the issues, and I know that people are using it, but I refuse to add analytics, so… Hard to say.
You’re one of us. You are one of us. So how does it get into your credentialed stuff? Do you have to be using Google Chrome? Is that the extension? Is that what that does? Or is it grabbing cookies out of your cookie jar? How does it do it?
Yeah, it’s constantly evolving. I’m trying to make this as smooth as possible. The golden rule is don’t let people use their normal accounts. This is based on talking to a lot of my industry peers… We just don’t think that the scrubbing tech is there yet to sanitize these archives. And unless people really, really know what they’re doing, which some people do, and they can save that stuff, you don’t know who the audience of an archive is going to be in five or ten years. And so people are going to forget, “Oh, this archive was saved with cookies turned on, which means your whole personal information is probably mirrored in an HTML somewhere.”
So I basically force people to create separate accounts for archiving. If you want to archive Facebook stuff, you make a second fake Facebook account, invite it to all the groups that you want it to have access to. It’s an arduous process, it’s annoying, and I’m being paid by companies to automate it. So that’s how ArchiveBox is a sustainable business right now, is that’s the paid service that I offer to companies. It’s creation of sock puppet accounts. There’s no engagement. I have a hard rule: I don’t allow these accounts to do anything other than view.
But you create these accounts, you log them into all the groups that you want to be able to save stuff from, and then these accounts will archive on your behalf. And that way, if the accounts get burned by an archive being shared or something, it’s fine. They’re not real info, they’re not tied to anyone.
Interesting. So that’s some of the labor you were talking about earlier. This is hard work. It’s not like just download it and click go. You’re going to be doing some stuff here.
Yeah, it’s not too bad… So the recent changes, I’ve made it smoother. There’s a VNC container running in the background, so you can – it’ll open Chrome automatically. You can just go to a new tab. You’ll see a desktop Chrome, you log into all your sites, and then it’ll save those cookies automatically, and then you just close it and you never have to think about it again. It’ll stay logged in. If it kicks you out of some site, you just reopen that VNC window and log back in.
So I’m trying to make it as smooth as possible… I do allow you to import cookies from your existing Chrome. I just strongly don’t recommend it… Unless you’re the only person who’s ever going to look at your archives for the next however many years. Or if the people that you’re sharing the archives with are people that you really trust, or if you’re willing to manually sanitize. And I think most people don’t understand that risk, so I don’t make it too easy.
[51:54] Is this only a single-player game? Is there an archive scenario where it’s a group? Let’s say Jerod and I were like “Man, that was cool. Nick is awesome.” And we start our own archive, essentially. And it’s like, anybody who’s in and around the Changelog Podcast universe - I just had to say that, Jerod - they can join in. Or there’s a mission here. Similar to the way you would have a core team member, or commit rights, you can have this membership, so to speak, to an archive. Is that out there?
Yeah.
Is that part of your plan?
Yeah, that is my plan. That’s the core mission, is actually to serve that group. So ArchiveBox is primarily aimed at organizations, to save what they collectively care about. And so there are users, there are permissions, there’s sharing stuff, there’s multiple logins… And the idea is your org probably has shared ability to access some resources. So your org only has to set up these credentials once for the archiving bot. And then when people submit URLs, it doesn’t archive with the person’s URLs, it archives with the archiving bot’s URLs. And so an org can collectively maintain access to all the resources that they care about, and then the org’s archiving bot will also have access, and will just save any URL that anyone in the org submits. And that’s how the paying customers are using ArchiveBox today. So I work with nonprofits that monitor disinformation campaigns and look for evidence of war crimes on social media. As I was saying before, it’s lawyers who pay for this. They pay for evidence collection, both to catch the social networks breaking their own terms of service and their own rules, to help governments with regulatory issues around how social media is behaving, but also to look for like war crimes.
That’s interesting.
So they’re doing this method of shared one collection, and they have teams of researchers that submit URLs to the shared collection. But you can’t reveal who the researchers are, because they’re researching really sensitive content. You can’t burn their identities.
Yeah, it’s like a journalist and their source.
Exactly.
When you got into – when you even first had the spark of this idea, did you think that’s what you would be doing to sustain it and get paid?
Sock puppets.
No, but now that I’m working on it, it’s a surprisingly fun problem, because I get to red team. I love security stuff. And now I’m a red teamer. Literally, my job is to break Captchas and rate limits and login walls, for good causes. I’m anti-disinformation, especially after the recent election. It’s motivating to actually work on what matters right now. I feel like this really matters. And directly working on anti-disinformation and mass social media manipulation is motivating.
For sure. What an interesting job you have. Wow. So Jerod, what are we doing about our ArchiveBox? When do we spin this thing up? Did you already spin up a new Fly machine for this?
I have not tried it yet.
Okay. I’m excited about Docker being the – is that one of the primary ways that folks do spin it up and play with it? I imagine like a Docker compose or Docker file just generally is an easier thing than anything else.
You know, I would think… But the archiving crowd attracts a lot of people who still want to do stuff the old school way, unfortunately.
Which is zip files onto a machine?
Yeah, or apt-install every single dependency manually.
Wow.
Some people really want to do that. But unfortunately, a surprisingly large amount of the user base will not touch Docker, and will only apt-install every single dependency manually. And so I spent the last two months writing my own runtime dependency manager for ArchiveBox. It’s a whole new library called ABXDL, that uses the Python type system to basically have unique… I went a little overboard designing this, but it was pretty fun. Basically ArchiveBox is now pluginized. So people can contribute plugins. It’s really hard for me to maintain the auto login for Facebook and Twitter and Instagram and TikTok and YouTube and Quora, and all of these. So I want a community to come build around little scripts that do things automatically while archiving. And I’m working with other archiving companies to sort of share a common spec for this.
[55:52] But part of what these plugins need to be able to do is access dependencies. So YouTube DL, or Wget, or Curl, or things that the user might not have installed on their system. And so if I’m allowing people to install plugins from an app store ecosystem type deal, it needs to also be able to install random packages at runtime. And so ArchiveBox now has this whole built-in package manager.
I have a rant blog post about the inevitable progress of building a tool is that everyone eventually bakes a package manager into their machine, into their tool. Like, once you go far enough in any product evolution, eventually you’re going to have to write your own package manager.
So ABXDL is both a runtime, as well as a CLI tool? Am I reading that right, based on the repo on ArchiveBox on GitHub?
I wouldn’t disagree… It’s closer to like an ORM for package managers.
Gotcha.
It’s just a layer between software and the system – like Ansible, or Pyinfra. In fact, it uses those under the hood. It just gives you nice, clean Python types for different packages and package managers, and it allows you to define in a sort of flat YAML format all of the things that a plugin needs, regardless of whether they come from Brew, or Pip, or npm, or Cargo.
I dig the writing here. You say, “Ever wish you could YT DLP, Gallery DL, Wget, Curl, Puppeteer, etc. all in one command? ABXDL is an all-in-one CLI.” Is that not the same? Is that not the same thing? Is that a different thing?
I’m sorry, I mixed up my own names. ABXDL is ArchiveBox, but simpler. I was referring to abx-pkg, which is [unintelligible 00:57:26.28]
Ah, okay. That’s where the confusion’s at. Okay.
Abx-dl sounds cool, though.
Abx-dl is a simplified ArchiveBox that’s a one-liner.
It’s a one-liner for all the tools you might need. So it’s like, you give it a URL, and it’s going to figure it out.
Rip every piece of content that you possibly can out of this page, by any means necessary, and put it in our folder.
That’s cool.
I like that tool.
Yeah, I like that tool a lot. To clarify the confusion here, abx-pkg is the runtime you’re talking about.
Yeah, correct. Sorry about that.
But you said abx-dl, and so I went up and found your repo, and then [58:01] in a positive way, but now we’re less confused.
But now we’re more excited, because we know two tools, not just one.
That’s right. We’re getting two for one here, okay?
[laughs]
That’s why I like you, Nick. Abx-dl is pretty cool. So what you’re saying then, if I’m reading this right - is this ready for primetime?
No.
Okay, so this is coming soon.
Wait, which tool?
Abx-dl.
Yeah. Abx-pkg is ready. We’ve been using that for months now. Abx-dl I just announced, because it’s this evolution of pluginizing ArchiveBox. Inevitably, it makes it a little bit too complicated for some people, and so ABX is stepping in to fill in behind and basically provide a new tool that is way simpler than ArchiveBox to all the people that really don’t want to spend time with Docker, or setting up services, or logins and all that. They just, like, “Give me the files now.” Because that’s how ArchiveBox started. Originally, it was like abx-dl, and it evolved so much that now we need a simpler replacement.
Yeah. To put it more simply, you write it well. “Abx-dl is a CLI tool to auto-detect and download everything available from a URL.” So just like you would use - which I use - yt-dlp, I obviously use Wget, I prefer Curl, but either/or. Pick your flavor. So if you’re using these kind of tools, you can potentially, at some point in the future, replace those things if you’re trying to archive with abx-dl.
Yup. It should be a fairly drop-in replacement. It’s got a few of its own flags. You can provide cookies, you can tell it to ignore SSL warnings… It’s got the usual things that you would be able to configure… But I’m aiming for a direct drop-in replacement for Wget or Curl.
I want to confess something here on the show, if you don’t mind…
We always like confessions.
One thing I do like to do sometimes is I run my own Plexbox, and I don’t always want to – it’s almost my version of archiving, now that I’m thinking about it out loud… I will take some music that I like from YouTube - and it’s not to take it from me so I can give it to everybody else and be a distributor, it’s more so I can have my copy, and I’m not spending web resources. I’m spending LAN resources, so to speak.
It’s allowed. That’s legally allowed.
[01:00:11.18] Yeah. And so I use yt-dlp to pull down different things into a WAV file, mostly like coding music and stuff like that, that I’m like “I want to keep going back to this YouTube URL, and have a tab open.” I would rather just have it play in my truck, or play on my phone, or wherever it’s at. So Plexamp is the iOS app, and so I can play that from my Plex, at my home, wherever. And so I use yt-dlp all the time. I mean, all the time - like several times a month, all the time. But enough to be like “This is a useful tool, and this is how I use it.” And occasionally, I’ll pull down a video if I want to archive it forever, but my file system has been the archive. So I think I’m like one step removed from actually becoming an ArchiveBox user.
That’s great. That’s how a lot of people start.
That’s how it works, yeah.
You start with the content you care about, and hold on to that, right? Use that as motivation to get more into archiving. Don’t break yourself into having to save everything. Just save the stuff you want to save.
Yeah, I like the idea and premise. I think the thing I want is I want it to catch on. And I think organizationally it’s good. That’s where you’re sort of seeing a lot of the movement, so to speak. But I still think there’s opportunity elsewhere, but I think it might just get burnt out. I don’t know what would motivate somebody to do it continuously forever if it wasn’t legacy things, like we said earlier, you know?
Isn’t it just a cron job after you got it all set up? I mean, what do you have to keep doing?
Yeah, so part of it is on me to make this easier. My tool right now is not incredibly so user easy that you can just set it up and it runs in the background forever. I’m trying to get there… And once it is at that point, then I think it’ll be less important to select for people who are really motivated to archive. But right now, because there are still hurdles to curating and managing all this storage, and passing hard drives around, and deciding who gets to look at it, and scrubbing stuff out, I am selecting on purpose more for people who are willing to take on this workload.
There are other tools… WebRecorder is amazing. They have a new cloud offering that lets you do stuff. They’re the team that I’m collaborating with on this Behaviors spec we’re calling to share these plugins between different tools. There’s single file… There’s lots of browser extensions that make it fairly easy to save stuff passively, as you’re browsing. I think those are great options for people that are looking for sort of easy, passive archiving. But yeah, a lot of the hard decisions don’t come until you’re six months into archiving and now you have a few terabytes that you need to move around between places.
How big is your personal archive?
I have – I guess there’s a fuzzy line. So I have many personal archives for different things. I tend to start a new collection for a new campaign, I guess I’ll call it. A lot of different tools call these campaigns. So if I care about my YouTube favorites, for example, that’s going to be a hefty bucket of stuff. So I’ll start a dedicated collection just for that.
That’s probably the biggest one. It’s a few terabytes, it’s not insane… But then I have a bunch of these collections. And so altogether, I probably have about 20 terabytes saved in a little ZFS thing over there on the shelf… I’m a big bare metal fan. I tend to not pay for lots of cloud hosting. It’s mirrored. I have a 3-2-1 backup, but… I think that all in all I have around 20 terabytes.
Break: [01:03:29.29]
As you’re describing these YouTube favorites, I have many playlists on many social media accounts. And I would say the one I would probably almost covet, like love it to death almost, is my YouTube playlists. They’re all private, obviously. Only I can see them. But now I’m thinking, like you said that, I feel like if I can archive my playlists, then I know – because there’s times I go back to them and it says “This video is not here anymore because it was removed.” And I’m like – it was useful to me at one point. I’m not trying to get somebody politically, for any reason. So I know it’s not that kind of content. It’s just like, for some reason, somebody got upset and it’s not available to the public anymore. And my ability to archive that… Now you’re making me – see, you’re getting me. You’re getting me.
Yeah, definitely save that stuff. YouTube, I think, is a great starting point, because –
For sure.
…it’s also, interestingly enough, text copyright, audio copyright, video copyright, music copyright… They’re all very different fields legally. There’s not that much overlap. Like, the way those cases are handled, what the precedent is in the courts is very, very different. You have a Supreme Court judge to thank for the ability to save video locally, who had a TiVo, and was like “I don’t understand why I can’t just TiVo my stuff at home. Like, who am I hurting by doing that?” And so you have a fair use exemption to basically TiVo your video content at home. Now, of course, platforms will argue you’re violating their terms of service by cloning that… But realistically, the precedent is set. You can save video that you care about at home, and it’s probably going to be OK as long as you’re not charging people to access it, or depriving the original creator, spamming it in their public channels saying “Hey, I have a free version. Come over here.”
It’s an interesting problem in the fact that you have this ArchiveBox idea. And the things that you do to do the archiving is you, as an individual or an organization, you identify something worth archiving. So that’s step one, right? Step two is having the necessary software technology, whether it’s a plug-in, or a CLI tool, or something that goes out there and gets the thing and says “Okay, I’ve got the thing.” And I assume as part of the ABXDL at some point you’ll have some sort of config that says “This is where you put it.” And that’s the ArchiveBox that is the file system, that’s ZFS-backed, praying everybody follows your rule, or at least your desires… And then you have this ability – this viewer, so to speak; the hallways and the rooms of the museum. Those are the different – am I missing anything else that’s in the sphere of how you would interact with, or curate, or view this museum/archive?
No, you basically perfectly identified it. There’s different words used for those different areas. The viewer is often called the replayer, because you’re replaying a recording. But yeah, that’s basically it.
Okay. So the ArchiveBox as it is now… If I went out there today and spun up the Docker – because I’m that kind of person; I would spin up the Docker version of it. What is that? That’s not the DL thing, right? I mean, it is, it’s baked into it as it is, but this ABXDL is a secondary CLI tool that enhances or adds to what the ArchiveBox will eventually do, or does now currently, right?
Yeah. So to dive into the nitty-gritty for like a couple minutes… So ArchiveBox internally is a Django application. It exposes a command line interface that is the same package as the Django web app. Like, it’s an all in one pip package.
So you can pip install ArchiveBox without any of the Docker stuff, and you immediately get the CLI, you get a Python API… It uses SQLite and it just saves to whatever current folder you’re in. It’ll create a collection, it’ll create a SQLite database on disk, it’ll create folders for all the archives and logs and all that.
[01:10:11.21] So you don’t need a continuously running container at all. If you just wanna basically replace YouTube DL, you can pip install ArchiveBox, ArchiveBox add HTTPS, whatever, and it’ll just spin all that up locally and archive that one URL, and then exit. And then if you run another command in the same directory, it’ll add the next URL to the same collection. You import a thousand from Google Chrome, it’ll run them all right there and exit. So you can use it as a CLI tool, you can use it as a long-running app, you can use it as a Docker container. All of these are actually just one Django package underneath.
And that’s like the first principles of this, because then you’ve got the challenge where you’ve got orgs, you wanna view it and you wanna enjoy it… Well, you’re not in that setting whatsoever. You’re probably on the web, you’re probably in some sort of web application… And so your viewer - would you call them a playback person? What was the terminology for it?
Replayer.
Replayer, yeah.
Yeah, replayer. So if you got a replayer out there, they’re probably on the web. That’s a whole different problem set, right?
Yeah, so the CLI tool – because everything just saves raw, straight up to the file system as raw files, you don’t actually ever have to see the ArchiveBox UI at all. You don’t have to use the replayer, you don’t have to use the admin interface, you don’t have to use anything. You can just use the file system. Or some people never see the file system at all. They’re running it on fly.io, and it’s a hosted file system and they only see the web UI. And so yeah, fundamentally I’m serving like two different groups.
I personally use both heavily. So I’m running my own web UI, but I also very often go into the file system, because I want to play with a local LLM, and I want to train it on all my YouTube videos, or I want to train it on all the articles that I read last month, or stuff like that.
Right. The reason why I’m asking you how to experience it is because I’m literally thinking about “Okay, if I started to do this, one job is to archive. Got it. Okay, cool. It’s on my file system.” And the next job is later on I want to experience it or replay it, and be the eventual consumer I will be of my YouTube playlist, for example. And I’ll admit, it’s mostly cooking videos…
All the confessions…
It’s mostly cooking videos. Right now I’m trying to perfect my Chicken Parmigiana recipe. I am trying to nail it, from the sauce, the original, tomatoes to use, the garlic… All the process. Which olive oil… I’m trying to perfect it. And so I’ve got a collection of videos. And so future Adam, once I’ve perfected it, or my kids, even a year from now, they would want to view this stuff. But the here and now is the useful. I think if you can make this archiving like useful today to me, so that if it’s useful for me to archive, and then also experience my archive, means that I’ll curate it better over time. Because it’s today useful, not tomorrow useful, or some fictitious future that may or may not even come to fruition. That’s what I’m thinking about… Because I’m already doing that in a way with my music, but I’m not using it in the way – I’m doing it in a way that is today useful. And today useful is on Plex, and experiencing it as music, because that’s what it is. Plex doesn’t really serve me to serve my YouTube playlist, but this Django app or this web interface could be more full featured at some point, so that you invite people to archive and experience today, so that it has future generation payoffs.
Yeah, 100%. You’re touching on a really key part of why archiving is hard for – it’s hard for it to spread virally, because you need to convince people that it’s useful today when most people only realize archiving is important when it’s too late, once they already are missing something. So making it really useful today is super-important to me, and I think another big part of that is search, making sure search is really good, making sure you can quickly find… I go to great lengths to get the subtitles for every video, and add them to the full text search, so you can search by content of video. Extracting text by any means necessary is super-important. Making sure that the search engine is fast, works really well… We use Sonic, which is a Rust-based Elasticsearch all-in-one binary replacement. It’s awesome.
[01:14:17.28] There’s other ways that we can make it really useful now, too. We can try and do - not everyone wants this, but some people really want it - AI-based summarization or categorization after the fact. So let’s say you have 1,000 URLs saved. I don’t want to have to go in and click through each one to find the article that I care about. What if they all also had a column that was a two-sentence summary of the article, and the author and the byline and the date it was published extracted out? So I call these extractors, and ArchiveBox is designed to be able to add many extractors over time. I envision it being like a Home Assistant type ecosystem, or Next Cloud, or WordPress ecosystem where you have tons of plugins for all the extractors of the things that you care about, and the extractors come with their own replayers.
So if you have an extractor that specializes in getting YouTube videos, it will also provide a nice replayer UI to look at your YouTube videos. If you have an extractor that gets article text out of the page, it should also provide a nice article reading UI. If you have an extractor that gets cooking recipes, but it just gets the recipe part, then you also need a replayer that shows cooking recipes nicely. And so this is how I imagine the ecosystem evolving over time.
Yeah. It’s almost like an internet on top of the internet, powered by, I would say probably like importance to somebody, you know? It’s almost like its own index, too. That’s why I think there’s a lot of like – the possibility, the potential here is just tremendous if you can put it out there in the right way. I’m not saying the way you’re doing it is wrong, because you’re iterating, right? You’re trying to get to this eventual long-term really useful thing. Because if I’m an archiver and I do things well and it’s useful to me, and I can expose that stuff in some way, the things that I think are important to me because of who I am or what I do or the way I think, that adds layers of importance to the thing itself. It’s not about the actual content. And the archiving the content is one important aspect, but it’s also what was archived; not what is it in the literal files. Or the content. It’s like “What was it? Who and why?” Those are things that I think is like a sentiment layer that’s just not out there, really. And I think if you can find a way to expose that, then you sort of like get this aspect of invitation into it, either as a consumer or a replayer, as you’ve said, or somebody who’s actually an archiver, and joins in.
Another really interesting idea that other tools have played with is preserving the context in which a page was discovered. Like, “Oh, I clicked these three links in a row from this Google search, and that’s how I’ve found this thing that I then decided to save.” Saving that whole research chain of the URLs that you’ve found is maybe interesting context, and that makes it more valuable.
Possibly. Possibly. It’s like session replay, in a way, for a scenario. I can see how that adds context, but it’s also complexity.
Yeah.
I don’t personally see value in that necessarily, except for when I would see value in it, of course… It’s like “How did I actually find this website? Oh, that’s right… I was watching this, which I watched that, and that led me to this.” And that’s why I really don’t mind YouTube’s algorithm, honestly, because it’s interesting how it knows what I want to check out in the future. And my whole timeline is just full of Chicken Parmigiana. It’s endless.
It’s pretty easy for you then, I guess.
It’s easy. Yeah, it’s easy.
Yeah. YouTube doesn’t have me – it doesn’t have me figured out like it does you, Adam. It can just show you Chicken Parmigiana, but I’m constantly mad at it.
Is that right? That’s a shame.
Yeah, I get angry at it all the time. Like, I don’t want to watch this. And I subscribed to somebody six months ago, and you haven’t shown me one of their videos in three months, and I forgot they existed.
You know, I’m there, too. I’m with you on the same anger point.
[01:17:54.09] You should check out the Tweaks for YouTube extension. It’s totally changed my relationship with YouTube. It lets you change the homepage algorithm, it lets you make videos faster than 2X, it lets you…
What’s SideQuest right now? What is this?
Faster than 2X? That’s blasphemy, man. Come on. People create those videos for you to watch them. [laughs]
That’s right. 1X for life…
Not all videos, only the ones that are very slow…
I’m fine with faster than 1X, but faster than 2X? Holy cow.
It’s not the only reason. They also hide a lot of clutter in the UI… It’s basically like infinite configuration options for YouTube. I love that idea. I will check it out. My problem with that is I experience YouTube in so many different contexts that aren’t my computer. My phone, my TV, my computer, other people’s things…
Yeah, for sure.
Anyways, off on a YouTube rant. You were going to say something and I cut you off, Nick.
Well, I think, back to the earlier concept of this index being sort of like worth sharing as this collection together, or worth sharing of like the what of the archive, I think that’s a really important point. And the replayers… One thing to think about is if you take this to its logical extreme and everyone archives enough content that they care about that the internet is broadly copied multiple times over, what’s the point of hosting anymore? What’s the point of hosting stuff on your own? Once you publish it and enough people have archived it, just stop paying for hosting.
People already use archive.org like that today. And it’s kind of an interesting thought experiment to think about if this becomes the content distribution mechanism for the internet, what happens. But I also don’t think that will happen. I think that in any social system you have two ways to share things. You can share by reference, or you can share by copy. The internet right now is usually share by reference. You share a URL to something and it’s referring to the original content hosted by the creator. SMS is share by copy. When you text someone, they have a copy of the SMS. If you delete it off your phone, it’s not deleting it off of their phone. Email is shared by copy, BitTorrent is shared by copy… Discord is shared by reference. You delete a Discord server, everything on it is gone. Even though it looks like messaging, it’s not shared by copy.
So it’s kind of interesting to think about… I think most share by copy systems broadly will not succeed in taking over as being the content distribution mechanisms for the world. Whether that’s IPFS, whether that’s BitTorrent, whether that’s – anything that’s shared by copy, I don’t think it’s going to become the de facto way we share content, simply because it deprives the original creators of the power to monetize or delete their content. You can’t moderate, you can’t get rid of CSAM once it’s out there, you can’t get rid of misinformation, you can’t get rid of libel… Artists, musicians, creators don’t necessarily want to publish on a platform where they lose control the moment they share something the first time… It’s immediately copied millions of times, they can’t ever retract it, or ask people to pay for it…
So I think archiving is fundamentally limited in that societally, at the human scale, people don’t want to shift to losing control over their content authorship. And so people striving to make archiving do that to really replace all sharing of content by any other means I think are a little misguided. And so it helps to actually hone the focus a little more and make it easier to work on this problem to not try to replace the entire internet, because that’s where it goes quickly if you don’t think it through.
Well, I’m excited. I think Adam’s probably already got his Docker commands queued up… I think he’s – I think you got him, Nick. I’m a little bit more reserved in my – I’ll wait till Adam sells me. He’s going to sell me some –
I’m already doing it. So it’s like a better version of it, I think. It might help me organize myself.
Yeah. This sounds like something that you’re working too hard, and actually it’s going to help you work less hard.
Yeah.
So ArchiveBox.org… I did see that you went ahead and took the –
.io. I don’t have the .org.
My bad.
ArchiveBox.io. Oh gosh, you’re part of that crew.
Oh, yeah… I have some regrets, but .com is too expensive, and .org… I wasn’t a nonprofit when I first started, so I didn’t…
[01:22:01.21] I was going to bring up the nonprofit. So you actually went ahead and went through the time and effort to get that done. So that’s a step.
I’m not my own nonprofit. I’m a fiscally-sponsored project through the excellent Hack Club Bank, yeah.
I see. So you took a shortcut.
Oh, very cool.
So did that provide you some leniency? Because you mentioned you’re trying to decide should you go nonprofit, should you go profit? Do you have leniency because you didn’t – it’s like a proxy that you can change later? How does that work?
Yeah. So no matter what, I’m going to have to be both. There has to be a nonprofit component, there has to be a for-profit component. It’s going to be a sort of peer corporate structure relationship, similar to any company that does massive content re-hosting, like Archive.org, like OpenAI, like Mozilla, like MAPS… Basically, you have a nonprofit and you have LLC’s underneath it that do anything relating to money. The content is only ever hosted by the nonprofit, which is not earning revenue for it, but you can sell software that people use, that contributes to that pool of content.
And so the financial motivation to – basically, the financial motivations are kept separate. You’re not incentivized to profit off of the copyrighted material, which I think is important… Because as this eventually grows beyond just me, I don’t want to have sort of corporate structuring that is pushing it in the direction of destroying copyright.
Gotcha. Anything else? Any stone we have left unturned?
I didn’t ask a lot of questions of you guys. I would love to hear more about your own personal backgrounds. Have you ever inherited a big legacy collection of stuff from your parents or grandparents? Do you have any sort of personal interests?
Just photos. Nothing digital. We’re the first generation I would say probably, for Jerod and I, in digital. We have our parents in there, but by and large, for me at least, all my parents are dead, so…
Do you have kids now?
I do have kids, yeah.
Nice. What would you love to see them enjoy in 30 years? If they could only save let’s say a couple hundred pieces of your digital life…
Hm… His Chicken Parmigiana?
Yeah. Well, that they’ll always have fond memories of.
You know, I would probably say photos is probably the easiest one.
And videos, right? Those kind of go in the same category. Like personal videos…
Yeah, definitely videos. I’d kind of put them in the same lump. The Photos app… Everything in the Photos app. That’s interesting. I think it’s mostly memories, less artifacts. I don’t know, I haven’t really thought about that, honestly. I do think that eventually my copied versions of my playlists, that really feature Chicken Parmigiana, or the best steak ever, or the most amazing smash burger of your entire life… Those three things in particular are staples in our household.
You’re going to have to send me that last one. I’m a huge smash burger fan in the last few months.
Well, you have to come to my house, because that’s the best one. Sorry about that. And you’re invited. I’ll gladly make you a smash burger. I would say those kind of things I imagine my kids will want to take on… Because we make homemade marshmallows, we do interesting things for the holidays… And just generally, we like to make our own food and we really appreciate that process. I’m trying to get my kids to think about that kind of stuff more so, and what goes into the food… Even so far as like making your own sauces. [unintelligible 01:25:25.03] If I can buy that sauce for whatever, and I can buy the actual ingredients for one quarter of the price, and I enjoy it better, and I know what went in it, that’s to me an A+ for all the things.
So yeah, I would say those are the things. Things that point to those principles. Not so much the things themselves. I think this YouTube playlist with my buddy Frank Proto might be – I say my buddy because I actually reached out to this chef literally recently. This is really Plus Plus content, but either way, I’ll tell you. So I call him a friend because he’s a future friend. His name is Frank Proto, he’s a chef… And I reached out to him on Instagram, I’m like “Hey I’m a big fan. I’ve made your pancakes, so pancakes from scratch, I’ve made your spaghetti, I’ve made this and that… Big fan. How hard is it to book you for a podcast?” He’s like “Not hard at all.” That’s his only response, was “Not hard at all.” So long story short, a future Changelog podcast will feature a chef!
[01:26:23.16] Amazing.
Yeah, Chef Frank Proto. Check him out. Proto Cooks I believe is his channel, but he does some cool stuff. Anything he makes, I will make. Frank’s amazing. So I think those things are things that I appreciate, and I know my kids appreciate them because they have the second order effects of me making them for them, and so they’ll eventually appreciate where I’ve gathered my knowledge from. So I will eventually create my own recipe, that is a culmination of 17 recipes. You know, a trick from here, a tactic from that, or these particular tomatoes from that person’s recipe, or where they got them at. Or if I want to spice it up, this is how I do it. I’ve got the simple version, and the complex version. And it’s all cooking related, but I think that’s probably the easiest answer I can give you right now, which is something related to cooking.
Cooking is actually a shockingly popular answer to that question. A lot of people, myself included increasingly, as I’m starting the beginnings of a family…
I’m winning you over, right? You’re wanting take on my – we can share a box, so to speak. Yeah, my wife would love basically photos, some news, and a lot of cooking recipes preserved. And also some personal work portfolio is important to journalists, especially –
I think a lot of people that do writing for a living see a lot of their content sort of disappear when the publishers go bankrupt… So that’s a common answer I get. Yeah, everyone has a really unique and interesting answer usually to that question of what do they want to save.
And then the alternate version, if you don’t mind me asking one more follow up, is - now take away the 100 URL requirement, but now pretend you can’t save any individual piece of content, but your kids will get a model trained on everything that you save, with no limit. You could feed this model 20 terabytes of training data. What do you limit it to now? What do you want the model to have, and what don’t you?
That’s TMI. [laughter]
No worries…
Yeah, I’m also gonna – I’ll pass on that one, not because it’s TMI, although that’s hilarious; it’s because I would have to think really hard about that.
More food for thought for people to think about, because I think it sort of gets the gears turning on perspective.
Yeah, it’s an interesting question. I like that question.
I like that idea, though. I like the premise of the question, not so much the answer I’ll give. I like the idea of self – it’s almost like knowledge for the future, and this LLM is an encapsulation of some version of… The obvious answer is like, you know, just copy my psyche; copy my entire who I am. Go full on Ready Player One, or actually Ready Player Two, with an ONI headset kind of thing, and a replay of who I am. That’s the obvious best case. But that’s so weird, you know? Such weird implications.
But also, the victors write the history. You get a chance to rewrite your own history book. You can cut out all the bad parts, you can… Yeah.
Let me give a different answer. I’ve started to think about it more, and I realize this is a false dichotomy. There’s no reason that it can’t be both… But my answer is “Spend way more time with your kids and talk to them. About life, about what you think, about what you believe, and why you do what you do. Just spend a whole bunch of time with them, and you don’t have to give them a model. They’ll already have it.”
Well, it might not be for your kids, it might be for your kids’ kids’ kids.
[01:29:50.01] Well, people come and go, you know? We don’t have to like sustain our psyches into the future.
Well, that really cuts deep to the heart of archiving. I also believe this. I think that death is an important part of life. It’s sort of the recycling engine that really tests “Is this idea worth propagating or not?” Because if someone doesn’t propagate it, then maybe it wasn’t worth propagating. And that’s sort of where I want people to go when they think about these ideas. It maybe seems weird coming from the archiving guy to be like “Oh, you know, don’t archive so much…”
“Embrace mortality…” [laughter]
Yeah. But I honestly believe this, and I think that there’s some beauty in ephemerality, and that’s why I want archiving to be really intentional. Because you are depriving the original creator of that decision to let death recycle their ideas by dragging their ideas, kicking and screaming, into the next generation. But we have to do it; there’s a balance.
It is a balance.
It’s the only thing that makes life exciting, because what’s old is new again to so many people because there’s nobody to propagate forever.
Right.
You know, there is mortality, not immortality, and so I can have this idea which I thought was mine… But it’s not. It’s just recycled.
Nope. Somebody else had it.
It’s recycled. And it’s only new to me because it’s new to me.
A deep note to end on, perhaps…
Yeah.
It was fun. Archivebox.io, to clarify… Check it out. Man, if you’re jiving on this, we do have a Zulip. I’m sure there is an episode topic… Is that what they call it? Not channel. It’s a topic. Hop in there, say hello. Nick, I see that you have a Zulip for ArchiveBox, so if you want to dig deeper in the community, go hang out there in Nick’s Zulip for ArchiveBox, but also come in ours if you’re not there already. Changelog.com/community. And comment on this episode, and say what’s up and tell us what you’re archiving, or what you thought about this episode, or say hi to Nick if he’s there. All that good stuff. Good times, Nick. Thank you.
Yes. Thanks, Nick.
Thank you so much for having me. I’ll join this Zulip right away. I didn’t realize you all had a Zulip.
Heck yeah, man.
Zulip for life!
Outro: [01:31:56.19]
Not that I’m suggesting a rename, but because the .org is so expensive. A name adjacent - and it might be a terrible play on words, but a good play on words - is instead of ArchiveBox, what if it was ArchiveMachine? And then archivemachine.org is available right now for 10 bucks. Just saying. So you haven’t been entrenched enough where a name change might be impossible; it is available, and you are pursuing a non-profit future kind of thing, and you also have the Wayback Machine. So it’s sort of like adjacent to what people already might know. And so this is the Archive Machine that may power the Wayback Machine of your life, kind of idea… And the .org is available literally right now.
Cool. Yeah, ArchiveBox actually was a suggestion from a community member, Filippo Valsorda, who has been a long time supporter and an interesting crypto guy.
Oh, we know Filippo.
Yeah, he’s great.
He is awesome.
He is the longest term supporter of ArchiveBox; from the very beginning he has been reliably donating 20 bucks a month.
Nice.
And I know him from Recurse Center in New York. But yeah, I think either he or someone right after him in the same conversation thread, we were brainstorming name ideas…
It’s funny that you offer that, Adam, replacing the box with machine, because as you were describing some of the - what I would say like brand hurdles of us understanding like the current value of something like this, I thought maybe the word ‘archive’ was the one…
Our transcripts are open source on GitHub. Improvements are welcome. 💚