Changelog Interviews – Episode #475

Making the ZFS file system

with Matt Ahrens, co-founder of the ZFS project

All Episodes

This week Matt Ahrens joins Adam to talk about ZFS. Matt co-founded the ZFS project at Sun Microsystems in 2001. And 20 years later Adam picked up ZFS for use in his home lab and loved it. So, he reached out to Matt and invited him on the show. They cover the origins of the file system, its journey from proprietary to open source, architecture choices like copy-on-write, the ins and outs of creating and managing ZFS, RAID-Z and RAID-Z expansion, and Matt even shares plans for ZFS in the cloud with ZFS object store.

Featuring

Sponsors

SquareDevelop on the platform that sellers trust. There is a massive opportunity for developers to support Square sellers by building apps for today’s business needs. Learn more at developer.squareup.com to dive into the docs, APIs, SDKs and to create your Square Developer account — tell them Changelog sent you.

InfluxData – InfluxDB empowers developers to build IoT, analytics, and monitoring software. It’s purpose-built to handle massive volumes and countless sources of time-stamped data produced by sensors, applications, and infrastructure. Learn about the wide range of use cases of InfluxDB at influxdata.com/changelog

MongoDB – MongoDB Atlas is an integrated suite of cloud database and services. Try Atlas today. They have a FREE forever tier, so you can prove to yourself and to your team that they have everything you need. Check it out today at mongodb.com/changelog

Retool – Retool is a low-code platform built specifically for developers that makes it fast and easy to build internal tools. Instead of building internal tools from scratch, the world’s best teams, from startups to Fortune 500s, are using Retool to power their internal apps. Learn more and try it for free at retool.com/changelog

Notes & Links

📝 Edit Notes

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Matt, I’m a big fan of your work on ZFS, and I’m so glad to have you here at The Changelog, because I’m a newcomer to ZFS. So as you know, because you’re a co-creator of it, it’s been around for a very long time. It was created in 2001, and my first use of it was in a home lab production scenario powering my Plex server, basically, in the year of 2021. So I’m 20 years behind the adoption curve of ZFS. But when I found out about it, I loved the file system, and I was like, “You know what? We’ve gotta get Matt on the show.” So welcome to the show.

Thanks. Happy to be here.

Do you do many podcasts? Is this a thing you do? I know you give a lot of talks and you’re in front of people a lot around ZFS and the community and whatnot, but what do you do around podcasts?

Not really. I’m not really plugged into the tech podcast scene. I did one or two many years ago.

There’s like Two and a Half Admins, I believe, out there. I think even one of the writers around ZFS has been on there, and maybe even a contributor to ZFS, I’m not sure. I’m new to the ZFS scene, so…

[04:02] Yeah. I know the hosts of that podcast, and they’ll probably hit me up at some point.

Yeah, yeah. You should go on the show. I like it. I listened to a few of them. But yeah, I wanted to get you on to talk about ZFS, because it’s such a cool file system. It’s got some interesting roots in open source, and it’s obviously an OSI-approved license that it’s listed as, but it’s got some drama behind the scenes, and I figured, “Who better to go through the back-story of its origination and the problem set and its history, and then to current, than you, as you’re a co-creator of it?” Back in 2001, it was a file system designed by you and by Jeff Bonwick for the OpenSolaris operating system for Sun Microsystems. They were eventually acquired by Oracle. And I just wanted to go into that history, whatever you want to share around that process… Like the ZFS origination, what was the problem set, what was OpenSolaris trying to solve at the time, what brought you to Sun at the time… Wherever you want to begin. So open it up.

Sure. I joined Sun and the team just after college. So it was my first job out of college.

And I was lucky enough to be recruited by Jeff Bonwick to join him and work on a new file system. So at the time that I joined, they pitched it to me as like, “Come join and we’re going to work on a new file system.” I just thought that was the coolest thing I’d ever heard of. That was the motivation for me. I showed up, and it really was like nothing had been written yet.

So zero. That’s cool.

Yeah, zero. So Jeff and I started from, “What should this be doing?” And I was obviously a very junior software engineer at that point in time, so a lot of the ideas of what it should be able to do and where it should fit in the industry came from Jeff. But really, we wanted to make a replacement– originally, we were just thinking of it as, “Hey, UFS is kind of hard to use.” UFS was like Sun’s file system before ZFS. “UFS is hard to use. How can we make this easier to use?” And we looked around at how people were using it, mostly in the enterprise context. Most so them were using it with volume managers, with either Sun’s volume manager or the Veritas volume manager were very popular at the time… And the volume managers were hard to administer, hard to set up, and then they had all these weird failure modes that some of the in-house system admins at Sun had experienced.

Sun had a server that was called Jurassic. It was at the time a giant server that the kernel developer engineers ran themselves. So people took turns being primarily responsible for that. And it used UFS, and it used the Solaris volume manager, and I think that there were some horror stories that pre-dated my arrival about disks dying and being reslivered incorrectly by the volume manager, and maybe mistakes being made due to the difficulty of understanding what was really going on there, even from people who were very experienced with software and computers. So one of the taglines that we created after the fact, was that the goal of ZFS was to end the suffering of administering storage hardware. And I think that that’s pretty accurate. Yeah.

Yeah. It’s a very painful process.

It can be very painful, and I think that ZFS has succeeded, in large degree, at addressing the problems that we saw 20 years ago. So that’s the very high level of what we were trying to do. I’ll point out that we were not setting out– the high-level was not to create a product, it wasn’t to make the fastest software. It really was to address the pain points of the difficulty of administering. And I think that those goals or lack of goals kind of have a long shadow, right?

[08:06] I’ll get into some more of the specifics of what that meant, what those goals meant back in the day, but you look at even now, 20 years later, ZFS - it does perform well, but when people do benchmarks against other file systems, and a lot of times ZFS performs better than them, sometimes it doesn’t perform as well… I think that the people behind ZFS - we don’t really sweat that much. I don’t see people being like, “Oh, we’ve got to beat them in this maker benchmark. What’s going on?” I think that the thought is more “On the whole, ZFS is very useful”, and performance is part of that utility, but snapshots are part of that utility. Replication is part of that utility, checksums, compression. All of the different things in ZFS that work well together and are easy to use together, and hopefully easy to understand like what’s going on - that’s what brings a lot of the value compared to other technologies and products.

Back when we created it, and today, ZFS is not a product. There are products based on OpenZFS doing all kinds of different things. But OpenZFS is just an open source project, and we’re working on creating that fundamental technology, and making that easy to use for system administrators, and also easy to integrate into systems and products.

I think it’s interesting, this history of it, because I can imagine as a junior developer coming out of college with a blank screen, essentially, with Jeff - one, you probably grew up a lot in terms of being a software developer, and even a human being, right? I mean, your whole career has been spent, essentially, on what is ZFS and now OpenZFS, the project. I think that’s just interesting how you can attack a problem set way back in the day, in that exact scenario - come out of college, junior engineer, junior developer… First real job, right? It was your first real job as a programmer, and now you’re still doing it. It shed some light to, I guess, the interesting bits around starting. Sometimes you never know where you’re going to end up. Where you might end up is this question mark, and it’s like, well, it could be a dead project, or it could be something that people really get value over 20 years or more.

Yeah. I think that I was very fortunate and lucky, both to work on something that turned out to be so successful - it’s definitely more successful than our wildest dreams of 20 years ago - and also very fortunate to have the opportunity to work on something that’s brand new, even if it wasn’t successful. Creating cool technology is always fun, and it’s a great experience. And then to be able to do that with a great mentor, somebody that had more experience and was willing to do a lot of the hard work that was probably invisible to me at the time of making the project exist, and making the space for me and other developers to write the code while there’s a lot of other things going on within the company… People want different things out of it, organizational things, which thankfully I didn’t have to get too involved in at the time, but I know Jeff put a lot of work into that, as well as the work that he did on designing and implementing it back in the day.

So when somebody asks you, “What is ZFS?”, how do you describe it? What do you say it is?

It depends on who’s asking, I think, or what level I think that they’re going to be able to understand it… Because everybody understands things in the context that they have.

[11:51] Let’s say an everyday software developer that doesn’t know much about file systems. They know they exist. Sure, they’re on their computer, they use them, but they’re just an everyday developer. They’re not touching file systems too often.

So for developers - first of all, hopefully, you kind of understand what a file system is and what its purpose is in the most basic sense of storing files and data on hard discs. ZFS - our tagline from back in the day is that it combines the functionality of a file system and a volume manager into one integrated solution, and it brings enterprise-level storage technology to the masses. So those are technologies like snapshots and compression and replication… Those don’t really exist or they’re very primordial in more traditional file systems. And so ZFS lets you get a lot more out of your storage system and lets you build really powerful storage systems with just a bunch discs or SSDs, combinations of those, without expensive technology, without expensive enterprise products.

Right. Who’s using ZFS? I mean, I mentioned I’m a homelabber, pretty much. I’m a homelabber user, at least. And I would call my scenario enterprise homelab, because I don’t want to go down, necessarily. If the data died - it’s my Plex server, so it’s movies, right? I don’t want to rip all those movies again; it’s a lot. I think I might have like 10 or 14 terabytes of movies, maybe more than that even. I’m not even sure. 4k, 1080p. But that’s my use case of it. But I imagine it’s a lot of homelab users out there, there’s a lot of enterprise users out there… You mentioned it’s for the masses, so I’m the masses. I’m the user of that. You’re employed by Delphix, you get paid to do this daily… We talked about Sun from back in the day, acquired by Oracle… This has been a career for you, so you’ve obviously done some cool stuff with it, but where is it being used at the most highest level and at the most lowest level, like say a homelabber like me?

Yeah, it runs the whole spectrum. And one of the great but also challenging things about open source projects is that we don’t necessarily know where it’s being used, right? People can pick up the code and do whatever they want with it. We don’t have a product and a list of customers and numbers or things like that. But I can tell you about some examples. Obviously, there’s a lot of folks like you who are using it at home or in very small businesses. The majority of people using, touching, running ZFS commands is probably those kinds of users, because there’s so many of them. The amount of data or the demands on the performances of the systems might not be the highest. And I think that those types of users are the ones that tend to be underserved historically by enterprise-focused open source projects, because most of the work is done by people who are paid to do it, and they’re paid to make it work in some higher-end type of deployment.

Right. Some sort of paid scenario, some sort of– even though it’s open source, some sort of product that uses the open source to create a cloud product, or some sort of serverless or service, essentially.

Yeah. So if we go from there to the very highest end, there are folks like Lawrence Livermore National Labs, which is a US government research agency, and they run some of the biggest supercomputers in the world. They actually originally ported ZFS from Solaris to Linux to be able to use on these enormous supercomputers. So Brian Belendorf is the one who started that quite a few years ago, ten plus years ago, I think. And they’ve been doing a lot of the work to maintain the ZFS on Linux, and they’re using it in their huge supercomputers. I don’t have the numbers handy. They probably are available, but it’s like petabytes and petabytes and petabytes; huge, huge things that take up warehouses full. And I think that because of their leadership, a lot of other supercomputer-type applications have picked this up as well.

[16:11] Cray, HPE, Intel have all done work on putting ZFS into use in supercomputer-type applications.

One of the interesting things about those applications is that– I didn’t know this… When I learned about supercomputers back in the day, it was like, oh, it’s all just like you’re running a bunch of numbers and writing the numbers out. So presumably, it’s just like giant files. You read some big files, you write some big files; maybe you probably care about throughput. And that’s the traditional space of things like Luster. Luster is a distributed file system that can run on top of ZFS, so it has a pretty tight integration with ZFS. They advertise things like – basically, if you have enough servers and enough clients, then you can get your full network switch throughput of however many terrabits per second or whatever, because it’s fully distributed. But a lot of those workloads – HPC is not just these big file streaming stuff. They do a lot of small file creations, because a lot of these workloads are written by folks who are not file system engineers; they’re just trying to solve their problem, and it doesn’t necessarily map onto all giant files all the time. And so you see a lot of various workloads.

I did some consulting for Intel several years ago now, where they were trying to improve small file creation performance, which is not what you’d expect coming from an HPC-type workload. So even these largest, large-type use cases, they have a lot in common with–

Homelabbers, yeah.

–home users use cases, where it’s creating lots of files. That sounds like maybe downloading your photos, or reading your mail spool, or writing out all your little text files for your codebase if you’re a software developer, reading lots of small files when you’re doing a compilation. So a lot of these use cases really transcend the large and the small.

I would put in the middle a lot of use cases of ZFS where companies have taken ZFS and embedded it into another product. So one example is folks like iXsystems and Nexenta, who have made general-purpose storage appliances based on ZFS. So inside is ZFS, and then they have a nice management interface that makes it easy. I think Plex is probably also in that category. And there’s a lot of people who might be using those products. FreeNAS comes from iXsystems, so a lot of people use FreeNAS in their home systems. A lot of people might be using those types of systems and not even know that there’s ZFS under the hood. They’re just like, “I have a Plex. I don’t know what’s going on.” Yeah, it has compression. That’s great. It has RAID. That’s great.

That’s interesting, because I’m in this homelabber scenario and I think obviously I’m a developer and I’m a tech kind of person, but there’s this world where in the future where people are going to have, I would probably guess, their own clouds in their house. As the technology gets more and more accessible, you’re going to have the need for potentially privacy, and storage, and stuff like that. For example, I’ve got a unified network, I’ve got unified cameras that are a local to of my network… There’s obviously some external accessibility via Unify and stuff like that, but I’m trying to keep things local. Plex I can access from outside my network. That’s the extent I’m using it so far, and I plan to eventually move our podcast archive, which I think is around the same– it’s around 8 or 10 terabytes of data we’ve collected over the 12 plus years we’ve been doing podcasts… Which, you know, those archives are very precious to us. We lost the early days of this show, for example… It probably wouldn’t make or break the business, but we could never go back and alter that or remix them or remaster them or do something new in the future if we ever wanted to. So our archives are pretty precious to us.

[20:12] But I think it’s really interesting how you can do this project and serve such a wide degree of “customer type”, user type - enterprise, small businesses, to homelabbers. That’s just so wild how this file system could potentially power this future where I think more and more people will have a NAS on their home network. That’s interesting to me.

Yeah. And you mentioned private clouds, home cloud… There are a couple of companies– Joyent tried to do this, so taking the interface that you can get from a public cloud and sell a bunch of gear to companies that can deploy that on-prem and then have that same kind of cloud usability in a data center. You could imagine pushing that to smaller and smaller and smaller deployments, into home and small businesses… And I think some of the folks from Joyent like Bryan Cantrill, they have Oxide now, which is kind of doing something similar, taking on even more of the stack. But both Joyent and Oxide are using ZFS as part of their storage subsystem in those cloud in a box or private cloud type deployment scenarios.

So what is it that you think that makes people choose ZFS? Of all the choices they have, in the Oxide scenario, or in my scenario – I’ve got a 45 drives Storinator, for example, here. I think it’s got a 15-drive bay, for example. I can fill it up with massive amounts of storage. It’s a Linux box. It’s running Ubuntu 20.04 or whatever. What makes someone like me or someone like them choose ZFS for this storage engine? Why is ZFS the choice? What particular features, what makes them choose it?

So I think it’s out of necessity. There’s really two kinds of data. There’s disposable data, where you can put it on whatever you want. You can put it on your thumb drive with FAT32, and it doesn’t matter. There might be some performance requirements, but the requirements are not that great. And in those scenarios, you might use ZFS for interoperability if you’re used to it, but there’s not a ton of use cases. But if you care about your data, then–

I care about my data, Matt…

[24:12] Yeah. And I think most people do. If you care about this data, then you need to be able to have some redundancy with it. So you need the functionality of a volume manager, where you can have multiple drives, and some of them can die, and you don’t lose all your data. And you need to know that the data that the drives give you is correct, so you need checksums. And even just those two basic requirements, there’s not– I mean, there are other technologies that do that. They are much harder to use, typically. And so I think the choices are - if you’re deploying it yourself, ZFS is just so much easier. If you’re making a product, then your customers might not know or care, but building your product on something that’s as capable as ZFS is going to long-term rewards. And the fact that ZFS is under continuing development and improvement, it means that your product has a solid foundation that’s going to keep up with what the future holds for software and hardware storage. That’s just the very basics, right? I think a lot of people even forget about that, because there’s all these other cool, great features of the ZFS that are very exciting, like snapshots. Being able to protect your data with snapshots.

A lot of people nowadays are thinking about ransomware, and “What if somebody has some virus or whatever they call it that encrypts my data or alters my data or deletes it? How do I recover that?” Well, ZFS has built-in snapshots, it takes snapshots every day, every hour. The storage cost of them is very low. You’re only paying for the data that’s different, each snapshot, and the performance impact is basically nonexistent. So you get a lot of protection from accidental or malicious changes to your data very easily, and very low cost in terms of the hardware that you have to pay for it.

Things like compression built-in - I think a lot of people nowadays take this for granted. At least ZFS users probably take it for granted. But it’s not present in all the competing technologies, being able to just turn on compression. We’re using LZ4 compression by default, which is very, very fast. It doesn’t give you the highest compression of ratios, but it means that you can just turn it on and not worry about it. You turn it on and typically performance improves, because you don’t need to read and write as much data to your disk.

And then in more complicated deployments, people look for things like replication. I have data that’s on this machine, I want to get it to this other machine. I could use rsync, and that would probably work just fine if I just need to take one copy of the data one time. But if I need to continually move the changes over, rsync is very, very slow, because it needs to check every file and every block of every file to see if it needs to be sent. And then if you’re using ZFS to begin with, obviously you want to preserve things like the snapshots that you have on the source system, the compression that you have on the source system. So ZFS has these built-in send and receive commands that let you serialize the contents of a snapshot, send it over to another machine, and it preserves all the complicated stuff like ACLs, access control lists. That might be sound files and other esoteric things that might not be used that often, but when they are used, you don’t want to have to worry about “Is rsync preserving them correctly?” or whatever.

This is one area of ZFS I’m not taking advantage of right now. I do some replication; basically, backup. Because raid is not your backup, right? It’s just being able to store more and do more with a volume, not so much necessarily an actual backup. And this is one area where I’m not taking advantage of a feature, really, and it’s just because I’m still getting into the ZFS world, where do I find information… I find that it’s actually kind of hard to find all the information of what you can do with the power of ZFS, for example. There’s a lot out there and I know it’s probably getting better, but 20 years later, I’m still like, “Wow.” There seems to be a lack, or at least a vacuum; maybe not like somebody’s doing a bad job, but more like there’s an opportunity.

[28:18] If somebody’s out there doing more– I mean, there’s a couple books out there that I’ve picked up, that I’ve liked a lot as well, that really helped me school myself on what ZFS is and what it can do, but replication is one particular area where I’m still using rsync. I’m still using rsync and moving stuff over to a different store now. Granted, currently, that separate store is not a ZFS store, so I couldn’t do replication there. But when I do fully move over all my stores - I have a couple different RAID scenarios where I’m not using ZFS everywhere. I’m only using it in this one pool currently, and mostly because I want to prove that it works well, I can actually manage it. And so to your credit and everyone else’s credit involved in it - yeah, it’s pretty user-friendly. I can use ZFS pretty easily. It’s pretty easy to create a pool, pretty easy to manage a pool, pretty easy to do scrubs and stuff like that to verify my data… But that’s the extent I’m doing it. And it’s pretty set and forget it. I’ve pretty much gotten bored, because it doesn’t require a lot of maintenance. So good job on that part, at least.

That’s good. Yeah. I would say don’t feel bad about not using all the features of ZFS.

What’s there? I want to use it, Matt. If I have another ZFS store, I’m not going to be rsyncing; I’m going to learn replication at that point.

I think that’s great to learn that stuff. I would say that the ZFS enthusiasts who become evangelists talk up all of these capabilities of ZFS. And I think that, for sure, they’re all useful in different scenarios, but ZFS has a lot of capabilities, it can do a lot of things. That doesn’t mean that you should do all of them in all deployments, right?

It has all those things so that it’s flexible and can be used in a lot of different scenarios. And being interoperable with being able to run rsync to send it to another machine is like, “That’s just fine.” That’s said, I love ZFS send and receive, and it’s really cool. So you’re asking, where can you learn more about how to use this stuff…

Sure. Yeah. What resources are out there?

Yeah. The books that have been written about it are pretty good. The FreeBSD Mastery: ZFS is a good one. I think that there’s a version two or like Advanced Mastery. I forget what they called it. We’ll look it up on Amazon.

Well, the one I have is FreeBSD Mastery: ZFS, and that’s Allan Jude and Michael W. Lucas. And I think there’s a second counterpart to that, which is even deeper. So it’s like advanced, or something like that.

So I think those are both great resources. Those are the ones that are coming to mind right now. If you want to get an education, I think that those books are the way to go. But online forums and stuff are also very useful. It’s just like it’s more immediate and more personal, but the quality of information you’re getting is more variable, right?

I’ve gotten a lot of mine from YouTube, various blog posts, obviously Stack Overflow here and there; the book I’d mentioned, I haven’t gotten the advanced version of it yet, because I’m just not quite there yet, but it’s been very helpful for me. I’ve been taking my own notes on different commands I run for establishing a new zpool and stuff like that. We’ve gotten this far, actually, without even talking about– we talked about features, but not the specifics of breaking down what ZFS does. So it is the open source project, OpenZFS…

We haven’t talked at all about its–

The features and capabilities, and like “What do I type to do this?”

[31:54] Yeah, exactly. So how do you create a zpool, for example? What’s the step? If I wanted to create a six-drive scenario where maybe I’m a homelabber, I’m doing Plex - what would I do to create a six-drive? Would I do RAIDZ-1 one, RAIDZ-2? How do you choose which RAID level to choose, all that good stuff? How do you even choose the number of drives? Obviously, there’s a cost factor, but I think there’s a number of where if you wanted to do RAIDZ-2, you should do in multiples of– I think it was like 6, 8, 12… Some sort of number; you don’t want to do 7, for example, because it doesn’t map out well. Help me understand that.

Yeah. So actually, I wrote a blog post about this. I wrote a blog post basically saying, “Don’t worry about that number.”

So in my opinion and experience, the specific width of how many drives do you have - there aren’t magic numbers there. Basically, the more drives you have, the more performance you’ll get, the more space efficiency you’ll get… And there aren’t really magic points on there that are more optimal than others, aside from some very, very specific scenarios that don’t apply to common cases, right? Basically, you’re using a database, it has a fixed record size. If you’re for some reason not using compression, then maybe there’s some more optimal things there.

We wrote a blog post – I think if you search for RAIDZ, you’ll find it. The title of it is, How I Learned to Stop Worrying and Love RAIDZ, and it goes into excruciating detail about why this is true, about why people think that you need this power of two plus N, and then why it is not really applicable. And a lot of the reason is because you want to use compression, and probably either you’re using compression, or you’re using very large files in very large block sizes, because you have videos or something like that that’s not compressible.

And if you have compressible data, then you should be using compression, and then you end up with the variable block sizes. ZFS takes 128 kilobits of data and it compresses it down to a multiple of whatever the sector size is, so like four kilobytes. So you might have a 70 kilobyte – a file that’s a big file, and the first block of it compresses to 70 kilobytes, the next block compresses to 70 kilobytes, the next one compresses to 104 kilobytes, right? That means that any kind of fancy math that you’re trying to do to arrange for things to be laid out just perfectly just isn’t going to fly. And then on the other extreme, if you have large files, then they have large blocks, and then everything is easy and there’s no need to worry about getting things perfect, because when you have those large blocks, they can spread evenly over all the discs. So that’s one less thing to worry about, which is good, in terms of your deployment.

Yeah. I’d love just to lean on ZFS to be smart enough, to not have to worry about the number of discs and–

I mean, that’s kind of what you want to do, right? You want to take as much mental overhead on planning out a new store, right?

Exactly.

You want to put as much into that software to manage it than the person choosing the number of discs. You want to be able to choose things like reliable hardware, reliable operating systems, things that you can plan for, not so much, “Should I use seven or six discs?”

Yup. And you see that reflected in the user interface of ZFS. There are a lot of properties you can change, and things you can do with it, but hopefully, those things are all there for a reason. They have a real impact on what you’re trying to do. And then there’s a lot of the internals that you can do with module parameters, new kernel module parameters that are semi-documented, not supported, it changes the internal workings… You just shouldn’t have to deal with that. For the vast majority of people, you should never need to think about that.

[36:12] Of course, sometimes we fall short and there are things where it’s like, “Oh yeah, you really do want to change that tunable in this scenario to get really good performance for this particular workload.” But the goal is that you don’t need to do that. The system gives you very good performance and semantics out of the box, and then you can express your intent with the commands to set properties and stuff.

So getting back to your question of, “I have a home lab. I have some discs. What do I do?” Yeah, typically, with the number of discs that you’re talking about for that kind of scenario, it’s like 4 to 12 kind of discs; probably, you’re going to create one RAIDZ group and you’re going to put all your discs in it, and it’s going to be either RAIDZ-1 or RAIDZ-2. So there’s not a lot of real decisions there to make. I think you’re going to be running command that’s like zpool create, give the pool some name, RAIDZ-2, and then just list each of the six discs that you have. Whether it’s RAIDZ-1 or RAIDZ-2, it comes down to how much redundancy do you want. With RAIDZ-1, it can tolerate one disc failing and not lose any data. If you lose that second disc before you’ve replaced the first one, then you lose all the data. With RAIDZ-2, you can lose two discs without doing any replacements, and you’ll still have your data.

So in industrial deployments, the consideration is really “How long does it take to replace a drive before you get back to full redundancy?” And people are configuring spares, and timing how long does it take to do the resilver, and all that kind of stuff. For small home deployments, people probably aren’t doing that. They’re probably not configuring spares. The time to do the replacement is however long it takes me to order the disc online and get it shipped to me, or whatever. That’s the long pull.

So I’d say as a rule of thumb, if you have a bunch of drives, RAIDZ-2 is going to give you more redundancy than you’re probably ever going to need. If you want to live a little bit dangerously and risk like, “Hey, if two drives fail within the same week, then I’m going to have to go back to my backups”, then RAIDZ-1 will save you a little money; it saves you the cost of that one drive.

Which can get very expensive, honestly. If you’re doing– especially a Plex server, for example. If you want a lot of storage, I would say performance - not necessarily that you need all the size, that you need the performance between the discs, where you have like maybe a 10 terabyte is a common size for an individual disc for a NAS for, say, a Plex server, or anywhere you want something with decent performance; maybe six, maybe eight, but eight to ten, anything above that tends to be higher performance drives in terms of spin, and movement of data and whatnot, throughput, on the actual disc itself. But that’s a pretty large disc. So if you’ve got four of those, that’s - what, 40 terabytes?

It’s a lot of storage.

Right? If you’ve got six of those - let’s just do some math, Matt… That’s six terabytes. I mean, that’s 60 terabytes. It’s a lot. But in a RAIDZ-2 scenario, you’ve got reservation, then you’ve got– I don’t know how you say the other word, but it’s like reversation? I don’t know. It’s like–

Oh, yeah.

What’s the other word? What is that stuff when you get into the semantics of how you plan for overhead in these scenarios?

[39:52] Yeah. So now we’re talking about like - you’ve created your pool, you have a bunch of different kinds of data on there, right? Maybe you have your video files, and you have your home directory, and you have your movies, and you have a cache of other stuff… How do I manage that? So typically, people would create different file systems for each kind of use case. And so in ZFS, when we’re talking about the ZFS file system, the storage pool is like all of the discs that ZFS is managing, and then you can create file systems on the fly. They aren’t assigned any particular space on the drives, they just consume and release storage as needed. So creating these ZFS file systems, what’s inside the storage pool is very cheap and easy. And we use those file systems primarily as administrative control points. So you could say, “Here’s all my video files. Don’t bother trying to compress them, because they’re already compressed. On the other hand, here’s all my source code files for my development project. Let’s compress those.” So all the different ZFS settings and properties are per file system, so you can set them differently for different types of data that you have.

Now you’re asking about reservations, and we call them ref reservations, because it stands for referenced, so referenced reservations. So you might want to think about like “I have some space that’s for one use, and I want that use to never exceed some amount, and I want this other use to have some reserve space. I want to always have some space available for my software development project, but maybe the movies… Or my kids are dropping their movies into some NFS share”, probably not really in NFS share. It’s probably SIFs, or some fancy Dropbox thing on top of it, but “My kids are dropping their movies into some other section of this. I’m going to put a quota on that, so they can’t use more than five terabytes or whatever.” So quotas and reservations let you do that.

And that’s at the ZFS create layer, not the zpool layer?

Yeah. So that’s at the ZFS layer.

The file system. Yeah.

Yeah. So you have the pool, it has your 60 terabytes, but each file system uses a variable amount, just depending on what’s in there at the moment. And the reservations and quotas let you control that. So specifically, the reservation says, this file system and all the stuff associated with it, so all the snapshots and all the descendant file systems - so the file systems can be arranged hierarchically, where the children inherit property settings from the parents. So you could have– maybe you want to limit the kids’ movies, but each kid has their own directory, so you make that directory a file system. So there’s one kid’s file system, the other kid’s file system, and then there’s the parent file system that’s all of the movies. You could set quotas at any of those levels, and you could set a quota at all movies file system, and then that limits the space used by all the file systems beneath it, put together.

So the ref reservation is talking about, “I want to set a reservation that is for the space– it’s for what I, as a user, ignoring compression and snapshots and other stuff, I can see there how much space… Like, I want to reserve space that’s just for that.” And the idea here is the system administrator configured some snapshots, and those snapshots take up some space, but I want to reserve space that ignores those snapshots.

So the ref reservation is actually more expensive in terms of it reserves more space, because… Ignoring snapshots, I’m like, “Well, I already have a terabyte of space in here. I have a two terabyte quota, which means I can write a terabyte of new stuff, and maybe I could delete that original terabyte of stuff and replace it with other things. So I can actually write two terabytes of data here if I delete what’s already there.”

[44:04] So that means that the ref reservation has to make sure that there’s actually two terabytes available if there’s a snapshot of that original one terabyte. So it needs more space, but it’s kind of taking a different view. If you’re thinking as the system administrator and thinking about the cost of the snapshots, then you would use the reservation. If you’re thinking as the end-user and you want to ignore what snapshots are there or aren’t there, then you would want to use the ref or referenced reservation.

So that you can know how much storages you have available, right? That’s really to get an accurate depiction of what you have. And if you’re the administrator, you want to have a different view of the world. If you’re a user, you have an, obviously, micro-view of the world, because you don’t really care about the snapshots and stuff like that. It’s all the–

Yeah, exactly.

…fluff to you. It doesn’t matter. That’s the administrator’s job. And you might be that same person. You know what I mean?

In my case, I’m the same person; I’m the user and the administrator.

Yeah. If it’s the same person, then it’s a little easier to understand say what’s going on. But when there’s multiple people involved, then these concepts are– the detail and richness of these properties are useful.

One thing I haven’t heard you talk about yet, which I feel like - and as somebody who hasn’t designed the system, hasn’t been involved for 20 years and is just a user, is like the killer feature in my opinion, is copy-on-write. Is this ability to be a secure file system because you want data integrity, you want to verify that you have things you want to automatically repair, and it’s this copy-on-write that really sets ZFS apart from every other file system I can think of, is that copy-on-write. Can you talk about that a bit?

Yeah. I mean, it’s so fundamental to ZFS that we probably forget about it.

And you kind of have, for 40 minutes…

Yeah. And it’s not a feature, right? It’s not a user-visible thing, but it’s a kind of enabling data structure that allows us to have zero-cost snapshots, and to be able to have always– the data is always self-consistent. If you crash or pull the plug at any time, you don’t have to run fsck. What you have on disk is always a consistent view of the file system. So that was one of the decisions we made very, very early on. Before we wrote a line of code, I think we decided that that ZFS would be copy-on-write.

What made that come to light? Since it’s so fundamental, it goes that far back… Why did no one else copy this feature? Or how often is it used elsewhere? Why is it so fundamental? How did you get it there?

First I’ll answer why aren’t other people doing it, and then I’ll talk about who is doing it. So why aren’t other people doing it? Well, like a lot of features in ZFS, there’s a cost to it, in terms of a runtime cost. It can make things slower in certain scenarios. And copy-on-write can also make things faster in a lot scenarios… But checksums, which are on, by default, in ZFS - we view it as like this is a fundamental, enabling thing that everybody should be using. And if you really, really have some hyper-specific use case where you can’t pay the CPU cost of checksumming - okay, I guess we’ll let you turn it off. Or maybe this just isn’t the file system for you. I don’t know. We took that view with a lot of things in ZFS, including copy-on-write. But that’s why it hasn’t been used more widely, is that it’s complicated, and copy-on-write makes performance different, at least; not necessarily worse, but different.

[47:45] Now, as to how did we decide to use it, and who else is using it - well, at the time, I think the other major use of it was in WAFL, which is NetApp’s proprietary file system. And I think they had used it to great success, especially with enabling snapshots. A bunch of the kind of details of how they implemented the snapshots are different than how we keep track of them. But the fundamental idea of like we’re always going to be writing new data to new places on disk; we’re not going to be overwriting existing data in place - we saw that as snapshots are going to be just a base requirement in the future, and doing it any other way than copy-on-write is just not scalable. And you see that even today, there are snapshots in things like UFS, but it’s like you get one snapshot, or you can have snapshots, but when you create the snapshot, it takes minutes and minutes and minutes to go copy a bunch of metadata, or whatever. And we wanted it to be easy and cheap. We wanted to give people no excuse for not protecting their data with snapshots.

It makes sense.

The other thing that we saw, which is more from direct pain experience, is that with earlier file systems if you crash, you had your own fsck. And the bigger your discs got, the bigger your file system got, the longer it took to run fsck. And a bunch of file systems added things to reduce that time on big file systems, so you didn’t have to scan every data structure. You knew that it was only these certain ones that needed to be scanned, but still, we saw the trend of hard discs getting bigger and bigger and bigger. And even 20 years ago, it was going to be taking unreasonably long time to run fsck on that server that we were administering in the Sun kernel group, Jurassic. So even on Jurassic, which is– I mean, it had a bunch of discs. I’m sure that it was a tiny amount of storage compared to today’s storage systems, but it took like an hour to run fsck every time the system rebooted; or at least every time it crashed. So we saw that as totally unacceptable, and rather than make incremental improvements on fsck, we wanted to design the problem out of existence, so that we didn’t have to worry about it as disc sizes increased. And I think, in retrospect, that was definitely the right way to go, because storage is just so huge nowadays.

And the idea is for copy-on-write, is that– and I suppose because we do have much larger discs now, there’s more opportunity to always write new, versus copy or write over old, in an opposite scenario. You always have a lot of disc space available, or at least in large pools. Sure, it’s not “always” true, but the idea is that you can always write new; and that makes snapshotting easier, because you could just point to the newly written file, versus the overwritten file. And if ever you needed to revert, you could point back to the old file that was not overwritten, which is what makes your snapshots so much faster and so much easier. You can promote a snapshot to primary and kill the old thing altogether. There’s a lot of interesting things you can do, very much similar to the way Git operates even, right? It’s very similar to the way Git operates, with master and branch, or your main branch and different branches and stuff. It’s very similar to that, at least in terms of how you fork the data.

Yeah, definitely. Forks are just like ZFS clones, in terms of they start the same as some base, but then you can diverge them to put your changes into each one, and then you can go back to how it was before, and you can create lots of ZFS clones easily… Or lots of branches easily, I should say, about Git.

Yeah. What about the real feature I think is that people have been– especially homelabbers. I’m not sure – you could speak to the enterprise customers needing this, but I think it’s more apparent to homelabbers, because you want to start small, because you tend to have less money invested in discs, and you want to eventually expand. But once you establish a pool and you create some file systems– I’m not even sure it’s landed yet. I know it’s a pull request out there, which is RAIDZ expansion, being able to expand.

[52:04] I could just imagine – I was watching your talk on this, and I could just imagine the amount of mental overhead, the spaghetti in your head, thinking about how to explain… So I’m not going to ask you to necessarily explain RAIDZ expansion, except for as you might need to. You don’t need to point out the details, because you need a screen for that, you needed to demonstrate it. This is a visual thing for sure. But this is a feature I know that’s been long-awaited, being able to establish a RAID array and be able to expand it from six drives to eight drives, with no foreseeable necessary penalty, basically. Less pain. Now it’s possible, but it’s a PR, from what I understand. What’s the details on that?

This kind of brings us back to what we were talking first thing, about this being an open source project and how do we serve the home users. So this is a feature that has been requested for years and years and years and years.

Yeah, for sure.

But enterprise users don’t have this problem, because it’s like we buy the discs by the shelf-full. You just add a new shelf and create a new RAIDZ group from the stuff on that shelf, right? Or you just buy a new rack. It’s like a whole new rack and a new system, and then that’s what you’re going to use for the next 10 years… Hopefully not that long, but… Lifecycles in enterprise are very long. But for home and small users, it makes a lot of sense to say, “Look, I started out with four discs. I mean, these discs are not cheap when it’s coming out of my own pocketbook.” You’re talking about laying out $1,000 or something, and then I don’t want to have to lay out $2,000 or $1,500. I care about every extra hundred dollars. So sizing it for just what you need initially makes sense. And then a couple of years down the road when your storage needs to grow, you add one more disc or two more discs, without having to buy a whole new batch of discs, get your friend to bring their system over so you can copy it over there and then reformat it and then copy it back…

A lot of pain.

People don’t want to do that. It’s no fun.

So basically, this project is about doing all that complexity for you under the hood. You add a new drive… We have to move all the data around to spread it out over all the drives, including that new one, but it all happens automatically. You just type zpool attach, blah, blah, blah, blah, blah, hit Return, it says, “Great, the expansion is in progress. It’ll be done in 20 hours”, or whatever, when we’ve copied all the data around.

But the interesting thing about this project is how did it come to be. So a long-requested feature - how did it get funded? So actually, it’s funded by the FreeBSD Foundation. The FreeBSD Foundation is a nonprofit. They help to run the FreeBSD project, but they don’t run it, right? So it’s run by volunteers, but the foundation helps with administrative stuff. And one of the things that they do is they fund software development. So they contacted me a long time ago…

I can imagine.

I think three or four years ago. Actually, it’s got to be more than that, because I remember working on this when my second child was a baby, so at least four years ago now… And they said, “Look, we have this idea. We want to do something for ZFS users, something that’s going to help the small users, that isn’t getting done by the contributors today who mostly come from–”, well, I should say from the people who are developing new features. There’s lots of contributors that are maybe not even C developers, that are contributing lots new tests, input changes all kinds of cool stuff. But the new features are primarily coming from these enterprise use cases who are funding developers. How do we get something developed that’s going to help the every user?

[56:11] So they came to me with this idea of doing RAIDZ expansion, and I came up with a design of how it could be done, proposed it to them, and they said, “Yeah, let’s do it.” Given the timeline, I said it’d be done in a year, and four years later, it’s almost done.

[laughs] That’s awesome.

Yeah. I mean, this is primarily because of constraints on my time and just not being able to spend as much time as I thought I would on the project.

So do they come to you personally, or do they come to you through your employer come to you? Because your time on ZFS and OpenZFS is probably pretty divided in terms of how you personally spend it, right?

They came to me personally because they know me from speaking at the FreeBSD conferences, and stuff like that. Fortunately, I was able to arrange it so that the consulting work that I would do is actually through my employer, Delphix. So it makes it easy for me to not have to clock out and do the work at night using non-company systems. So it’s easier for me, and it kind of fits in with my role at Delphix, which is to develop software that’s for Delphix and to be a leader in the community. So I’m fortunate that Delphix values open source and they’ve seen the value of being a leader in this open source community in terms of our brand within engineers, and being able to do recruiting… Our team has recruited a fair number of employees from the OpenZFS development community. So the time that I spend reviewing pull requests on OpenZFS is on the clock time, right? I mean, my company is paying me to do that, which is pretty great. I don’t have to do it only on my nights and weekends.

You’re living the dream.

Yeah. So it’s worked out well for me, and I’m definitely very fortunate to be in that situation.

So four years later though, RAIDZ expansion.

Four years later, RAIDZ expansion…

The PR has been opened…

It’s not landed in terms of–

It hasn’t landed yet.

So is there a plan for 3.0?

It is hoped for 3.0.

I would be cautious about using the word “plan” because we don’t– OpenZFS doesn’t have developers on retainer, right? We’re not paying anybody to develop anything. It’s all volunteer. So we can’t speak too strongly about plans. We can do a release, but we can’t make anything get in, right?

It takes a lot of people to get it in. I’ve done most of what I need to do as the developer to get the PR ready, but other people– it takes a lot of contributions from different people. So we need people to do – code reviews is the big one, and there’s a lot of code there, a lot of very tricky code. So we need other experienced developers to do code review. We also need other people that might just be users to do–

Testing.

–testing. That’ll help give confidence to other folks in the community that this is going to work right and not break their pools.

And I guess that’s just as easy as – if I was an end-user, to test that I could use send to send all my data to a new pool. Obviously, I have to invest in the hardware and the drives and stuff and replicate, essentially, in my scenario. But I can use my existing production data copy of it, essentially, in a new zpool, the exact same scenario, and do an expansion on that pool. That could be a way that an end-user could help. But it’s a matter of getting access to that future branch and being able to compile it and put it on their machine, which probably takes a lot of effort.

Yeah. It takes a little bit of doing to know how to compile and install, because it’s a kernel module. It’s a little more complicated than just downloading your normal thing. It should just work. There’s automake, and autoconf–All that stuff is there. So the steps are type configure, and then type, make, install. However, depending on the particulars of the system, getting it installed correctly in a way that it gets picked up, for example, in preference to the kernel modules that might already be there if you have an Ubuntu system that comes with ZFS kernel modules already; there can be some tricks there. But this is a well-tried road. I mean, there’s hundreds and hundreds of contributors who have gone through these steps on all the different operating systems. So if you would like to help, it might not be a one-liner, but there are a lot of people that can help you.

What’s the best place to go to get that help then? Would you say the repository, or would you say an issue, or the mailing list? What’s a good place to say, “Hey, as a willing participant, I’ll help test this at least as an end-user”?

If you’re looking to volunteer on something specific, like “ I want to help test RAIDZ expansion”, then probably comment on the PR would be the right place. If you’re looking for, “I’m trying to compile this so that I can help somebody else. How do I get it installed?”, then the mailing list would be a great place to ask.

Okay. Cool. Well, before we move on to another topic, is there anything else in, say, the feature set of ZFS that makes people– like, we talked about copy-on-write being a killer feature. You didn’t really mention it, because it’s such a baked-in 20-year feature. It’s just the system. That’s just how it is. It’s not even a feature these days. Anything else? I mean, I think I like the ideas around the intent log, I believe is what it’s called, and the lark, which I believe is the cache…

[01:04:14.03] Yeah, the L2ARC cache.

The layer two cache, I believe. Those are a couple things that can sort of speed up systems. That’s one thing I’m actually taking advantage of. I have an SSD as my cache, which is a one-liner – I installed the hardware, one-liner to add it… You make it so boring to manage the system… Come on, man. Make it harder, right? It’s super easy to manage the ZFS system.

That’s great that that’s your experience. I mean, that’s absolutely our goal, is to make it easy to manage. So I would say, the killer features are the ones we’ve talked about: RAIDZ, compression, checksums, snapshots and clones, and replication. And those are things that have been in ZFS for a long time. Obviously, we’ve refined over the years and added other new stuff, but those fundamentals are what sell it for 99% of the users.

I want to dovetail a little bit back to the past, to some degree. There’s a ZDNet article that quotes Linus Torvalds as saying, “Don’t use ZFS.” I’m sure you’ve seen this, and I’m sure you’ve read it. And I think it’s really around licensing. He says, “Don’t use ZFS. It’s that simple. It was always more of a buzzword than anything else, I feel, and the licensing issues just make it a non-starter for me.” So I want to go back into the past to some degree, back to the Sundays before Oracle acquired it… Were you involved in licensing it? Share some of the drama, I suppose, behind the scenes, that kind of made OpenZFS possible… Because it was really close to not being possible, with the acquisition. Thankfully, Sun, and potentially even you and Jeff, were contributors to the idea to use the common development and distribution license 1.0, which is an OSI-approved license, which definitely makes it open source by being OSI-approved, in that thin layer of like, yes, it’s open source, or no, it’s not open source. But this ability to keep it going after this acquisition, that’s what I want to talk about. So I mentioned Linus saying, “Don’t use ZFS.” What’s the backstory there?

Yeah. So the interesting thing is that when we started working on ZFS back in 2001, it was part of Solaris. Solaris was proprietary software. It was, I think at the time, only available with Sun’s SPARC-based hardware. Solaris x86 came and went a couple times, so maybe it was available on x86 hardware at some point. But as far as we knew, when we started out, we were developing proprietary software. But a couple years into it - I’m going to say maybe 2003 - they started working on OpenSolaris, meaning working on open sourcing Solaris and creating OpenSolaris. And I wasn’t involved with those decisions or the licensing decisions. I was a junior engineer, two years out of college. Nobody asked me. [laughs]

“He doesn’t need to know about this.”

Yeah. So when w – well, at least when I found out about it, I was thrilled. I thought, “Oh, this is great. We’re going to do an open source Solaris. ZFS is going to be part of it. We’re going to open source it. This is wonderful.” We definitely didn’t imagine how successful it would be outside of Sun at the time, or how enabling that would be for our technology and my career to continue for so long. So we kind of lucked into that. I lucked into it being open sourced. We released it as open source first. When we integrated it into the Solaris codebase in October of 2015– sorry, in October of 2005 (a long time ago), it went out as open source the next week, before we’d ever shipped it in a product. That was really cool. People started using it, picking it up from the OpenSolaris bi-weekly builds.

[01:08:10.06] So then from 2005 to 2008-2009, we were developing it in the open. It was picked up by– I think maybe FreeBSD was the first other operating system to take the code and port it to FreeBSD. It became very successful there. And then towards the end of that, picked up by the folks at Livermore National Labs to port it to Linux as well.

Maybe I should talk about the licensing a little bit now. So the CDDL, as you mentioned, is an open source license. It was created by Sun to open source Solaris, and create OpenSolaris, which ZFS is part of. I can’t really speak to the motivations of why didn’t they use an existing license, why did they come up with any of the particular terms in the CDDL… But my understanding of the intent is that– it’s called a weak copyleft type of license, which means that the changes that you make to ZFS or other CDDL-licensed software… Like, if you make changes to our software and you ship those changes, then you need to make your modifications available. So that’s kind of similar to the GPL, as opposed to more permissive licenses like the BSD or Apache licenses, which are basically like, “Here’s some software. You can do whatever you want with it. You can contribute changes back if you want. You don’t have to contribute back the changes.” So it’s kind of more similar to the GPL in philosophy, in terms of you need to contribute back the changes that you make.

The main difference that I see with the GPL versus CDDL is that the CDDL explicitly applies on a per-file basis. So if you wanted to do something with ZFS and add some new functionality and not release that new functionality as open source, you could put it all into a new file, compile it with the rest of the ZFS code… Maybe you have some changes into the existing ZFS files that you do have to open source, but you could do that and not open source your new source file, and keep your new feature private in that way if you wanted to… Versus the GPL - it’s not as explicit about what constitutes a change that needs to be open sourced, and people generally interpret it much more broadly… Like, anything in the vicinity, you’ve got to open source it, and you got a GPL too. If your code is near our code, then your code has to also be GPL. That’s how people interpret it. And I’m deliberately being vague about what does “near” mean, because there’s dissenting opinions about that.

So Linus’ comments - as you heard in the quote, I think that Linus has no love for Oracle. And I think that he’s concerned, or at least at the time that he wrote that, he was concerned about–

Litigious Larry. He quotes him as Litigious Larry is Oracle.

Yeah. So I think that the reason that he was saying, “Don’t use ZFS”, was to avoid Larry suing you, sort of, is how I interpreted it. And I’m not a lawyer, I’m not giving anybody legal advice, but nobody has been sued for using ZFS since the NetApp lawsuit, which was a NetApp Sun lawsuit more than 10 years ago. And nothing came out of that lawsuit. So nobody won or lost in that lawsuit, everybody just dropped it.

[01:12:02.09] The reason why I bring this up is less to be provocative, like “Oh, Linus says don’t use ZFS”, but more around this unavoidable tension between developer, you, and everyone else involved in the creation of ZFS, and then eventually being open source through this license, and then the world-changing opportunity of software, and then the license that stands between that opportunity… Because I quoted Linus as saying that, but I didn’t quote him as saying (which I’ll do now), as saying, “I can’t integrate this because it seemed as though–” I’m tea-leaving here between lines, but it seemed as though he wanted to integrate ZFS into the Linux kernel, but was unable to do so because of, essentially, the license, the GNU license that Linux stands upon, and the difference with the CDDL license that ZFS was licensed as as OpenSolaris. And then I’m sure there’s some details in there that made OpenZFS possible, which is super-awesome, because despite this acquisition, this accidental, to some degree, open sourcing of ZFS, it gets to live on, and you get to have a career beyond this proprietary software you were originally hired to build… Which I think is super wild in terms of a journey for a software developer like you, and then a community to appreciate and enjoy and use your work. If you wrote your best software and no one can use it, did you write the software? You know what I mean? Kind of like the tree - did it fall? Did it make a sound? …kind of thing. It’s almost like that.

Yeah, I agree.

If no one can adopt your software and enjoy it, did you write the software? Kind of no, really. Right?

Yeah. I mean, that’s one of the reasons that I really love open source, is that it makes the software available. It makes it available without the constraints of any one company living or dying or deciding to do whatever. If it’s good software and it’s useful, then people can continue using it and extending it, and it can continue to be relevant. The fact that we could take the ZFS that Sun was doing in 2009 or ‘10, and take that and run with it as part of the Illumos project and part of the OpenZFS project - I mean, that’s open source. That’s what it’s supposed to be. There wasn’t really anything special that let us do that. The fact that it was an open source licensed let us do that. I think that it wasn’t a given that people would actually pick it up and continue the software development. So that’s one of the reasons that the Illumos project was created, to kind of continue OpenSolaris in general, and then we created the OpenZFS project to unify and provide some kind of leadership around ZFS development that was happening on Illumos and FreeBSD and Linux altogether.

And that happened in 2013, right? OpenZFS began in 2013.

The original project, proprietary way back before it was even open source licensed was 2001. So you’ve got - what, 12 years between inception of the project, several years before the common development distribution license was instilled, when you did the OpenSolaris part of that… I mean, if that didn’t happen– I don’t know who did that inside of Sun, but if that didn’t happen, then ZFS as we know it would’ve died, and it would be in Oracle now… Because Oracle is still developing Solaris. It would be the closed source ZFS, which has continued. You’ve got a fork in the road. This is history that I’m sharing with the listeners. There’s a fork in this road of ZFS, which is one that ended and bifurcated, really. You’ve got the OpenZFS version that began in 2013 or whatever timeframe that was; maybe the 2009 snapshot of the project. And then there’s the close sourced Oracle version still yet, that is ZFS, inside of Oracle, which I guess is just called Oracle ZFS.

Yeah. So Oracle continued developing ZFS internally, and just not sharing that source code with anyone. And that’s fine. And the open source community picked up the open source code and we’ve continued developing that. And people maybe have asked which one is better.

[01:16:13.26] That was my next question. Which is better, Matt?

That’s really an academic question, because nobody’s really baking off open source ZFS on Linux, versus Oracle ZFS. The target audiences of these are just very different. The target audience of the Oracle ZFS is probably people that have been locked in by Oracle. And if they could get out – you know, it’s not about which one is better, it’s just like, “Can I escape the clutches or not?”

Yeah. Well, the good thing is, is that you are continuing development. We’ve just speculated about what will be in 3.0. Some of the things that I think are interesting in the maybe category of open ZFS 3.0 - one, RAIDZ expansion that we talked about. A couple that hit my radar, which is ZFS on object store, which I think is– I saw a talk on that from a recent conference which I thought was pretty cool, which is ZFS and the cloud essentially, which I think is just really interesting to think about different clouds being different in the V/DEVs and whatnot. I run macOS primarily as my primary machine, so I’m excited about the opportunities of macOS support in the future. I mean, I’m sure there’s other cool stuff in there, but that’s what hit my radar in terms of, “I can’t wait. Looking forward to 3.0.”

So that’s the good thing though, is that it was open sourced, you and others are continuing to develop it, and there’s a community behind this. You’ve got the conference that happens each year, books being written, blog posts… There’s still a lot of momentum behind this project, obviously.

Yeah. We have a ton of people contributing every year. We have our annual conference. We have monthly video calls where we’re talking about new features and getting designer reviews and making sure bugs are being addressed… So the community is very active. If folks would like to participate, you can find info on openzfs.org. We have links to all the videos from past conferences and how to join our Zoom meetings monthly.

Cool. Well, Matt, is there anything else that I left unchecked in terms of talking about your career trajectory? The only open question mark I really have that you can touch on if you’d like is how you negotiated with Delphix to be able to contract on top of the open source? The reason why I ask that question is less like, are you an amazing negotiator? Probably. But more so, if there’s other devs out there who are thinking, “I want to keep contributing to open source. How do I negotiate with my employer?” Obviously, Delphix appreciates and embraces open source, so maybe developers are already at a place like that. But if they’re at a place where they embrace open source, what are some things they can do to do things you’ve done, to be able to buffer in the give-back and the impact, beyond just simply their daily nine to five with their job?

I think that the consulting per se, like getting paid for work is a special case. I would probably focus on how can you contribute to open source as part of your job. And I think that there it’s mainly about making sure that your employer understands what they’re getting out of it. Everybody wants to know what’s in it for them - developers, as well as employers. In my experience, Delphix has been involved with ZFS and OpenZFS for a long time, 10 years or so, and it’s a fundamental technology for our product. So the benefits are–

Super-clear.

Yeah. First of all, we’re using this and we want to make it better, and we want to make it better in the best way. We want to get the contributions from the community, and we want to be able to have other people from the community testing and validating the changes that we’re making.

[01:19:58.28] So that’s just on a very low level. We want our code to work, we want our code to be the best it can be, and to do that, we want to get these contributions from other people. In order to get the contributions from other people easily, we need to upstream our changes so that we don’t have merge conflicts all the time. And we want our changes to be validated and checked and tested by the community. So that’s a very low-level, fundamental, quantifiable benefit to the company.

The next level of benefit is the corporate branding, almost. It makes the company look good when people in the community see, “Oh, Delphix is contributing to OpenZFS. Oh, Delphix is leading OpenZFS. Delphix is helping to organize this conference about OpenZFS”, right? It creates mindshare around, “Delphix is a cool play. Delphix is a cool company. Even if I don’t know anything about their actual product of database virtualization and masking and whatnot, I know that they’re doing this cool open source work, and that makes them seem cool.”

So for our case, our customers, the people that we’re trying to sell to are generally not software developers, so it doesn’t go directly to our bottom line, but there’s a lot of other things that companies do besides just exchange goods and services for dollars, right?

Like recruiting.

Like recruiting. So I would say more than half of the team that I work on of about 10 people is people that I knew from the open source community. And a lot of it was serendipitous encounters, where it’s like I was asking one person, “Hey, we’re looking to hire. Do you know of anybody?” And then somebody else happened to overhear that and be like, “Hey, are you looking to hire? Because I’m interested.”

So in terms of the branding and reputation and whatnot, it’s a lot harder to pin directly on it. I think you’re going to have to find somebody within the company that believes in that, because it’s less quantifiable. But at least in my experience, the benefits have turned out to be very real in terms of the reputation within the software engineering community.

Yeah. What about you? How are you feeling about where you’re at with your career and what you’re working on? Any closing thoughts on like, are you winded with ZFS? I mean, 20 plus years so far with– I mean, I’m just going to imagine you eat, sleep and breathe, work-wise, ZFS, to some degree. Are you burnt on it? Are you done with it? Or are you more motivated than ever?

To be honest, ZFS it’s getting to be older, right? I mean, 20 years is a long time, even within enterprise software, and I think that it can be a challenge to remain relevant as things change within the industry, with things like – you know, first we had the challenges of SSDs with very different formats characteristics, then with virtualization changing where the storage hardware fits into the stack, and now with the cloud, even more so the separation between the storage hardware and the actual use of it. So I think it could be a little discouraging, but to me, the project that we’re working on now with ZFS on object storage has just been incredibly fun, and I feel like we’re taking ZFS to the next– we’re giving it some more legs that’ll keep it relevant for another decade. And it isn’t something that’s going to be used by every ZFS user today, but it’s going to enable a lot more ZFS users in the future by making ZFS integrate even better into the cloud and bring those capabilities of snapshots, compression, all that stuff, to object storage and good performance to object storage.

[01:23:56.10] I’ve really been having a blast the past year with the team, developing that and designing it. A lot of the code is actually in userland, writing in Rust, so we all learned Rust, which is really exciting… It makes me never want to touch C again, even though it is my job to do so, so I’m going to do it. But Rust - it feels so comforting now that I’ve learned it. The safety of it feels very comforting, and it makes dealing with raw pointers in C everywhere feel scary, as it should be. I would say, it should feel scary. It is hard. You’ve got to get everything just right with C in order to not have bad bugs. It’s more work, but that’s fun work, too. I see ZFS continuing to be relevant, because we’re adding these new use cases to it, and I find that really exciting.

On the Rust note, what made the team choose Rust? Was it because it’s on the network?

Yeah. So first, we chose userland before Rust. So we need ZFS to talk to the object store. So it needs to talk like HTTP, and HTML, and JSON, and all this stuff. And we did not want to do all of that in the kernel, so we decided, “Okay, we’ll have some userland process that the kernel is going to talk to, to say, “Get this block, read this block, write that block”, and then this userland process is going to deal with turning into S3 requests.

And once we had done that, then we thought, “Well, C is not the– there are languages that are higher level than C that could make our job easier”, and so we looked around at what the options were there, and Rust seemed like… Like, I didn’t do a comprehensive survey of every possible language, but because Rust seemed so similar to C in terms of it’s a low-level language, there aren’t scary things like garbage collection… You know, Java may have been another choice, especially given that the rest of Delphix’s software is written in Java, so in-house we have a bunch of Java developers… But the performance aspects of it - we felt more confident that we would be able to get all of the performance out of the hardware with a low-level language like Rust. And then having the ecosystem of all of that Rust crates would let us develop it faster. And then the safety of not having memory corruption would also let us develop it faster, because we wouldn’t have as many crazy bugs to debug.

Interesting. Well, cool. I’m sure we can probably do a whole entire separate segment that’s deeper than that answer there on Rust, because that’s always interesting to be like – well, you developed most of this in C, or all of it in C, so why would you choose Rust in the userland part of that? I’m always curious about those questions, but…

Yeah. I mean, C would’ve been the natural choice. I’m sure that there’s libraries that we could have found for C to do all of the network communication, JSON stuff… But I feel really happy about the choice to use Rust.

Good. Matt, anything else? Anything left unsaid? This is the closing, so any advice for those who are going to pursue a land that has ZFS all over it for them? Maybe they’ve got some spare drives they want to play with, they’ve got a home lab, they got that Plex server that’s still clunking on an old Mac mini or something like that and they want to move it to a Linux box with ZFS, whatever? What kind of advice you’ve got, closing thoughts?

I would say, just go for it. I mean, if you like tinkering, then I would just install the Ubuntu or another OS that has ZFS in it, maybe FreeBSD, and start running zpool create and whatnot. If you want to use ZFS but don’t necessarily like tinkering in the internals of everything, then a more packaged solution like FreeNAS would be another good option.

Yeah. You’ve got a good interface for just doing most of the work. You don’t have to do any of the command line stuff at all, really; it’s all just a UI for it, which can be nice, which can be nice. I prefer the terminal when I manage ZFS, personally. I feel I can actually feel the heartbeat of the software, rather than some UI trying to tell me what to do. I just could understand it more. Once I moved to the terminal to mess with ZFS, I felt a lot better. So that’s my take on it, at least.

Yeah. I love that as well. I know some people don’t necessarily take the same joy we do from feeling the heartbeat of their software running… So I’m glad that there are more packaged, guided solutions as well.

Well, Matt, it’s been a pleasure talking to you through your software career, OpenZFS, the future and the past of ZFS itself… I really appreciate it. Thank you so much for your time and I really appreciate you. Thank you.

Thanks for having me.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00