Changelog Interviews Changelog Interviews #475

Making the ZFS file system

This week Matt Ahrens joins Adam to talk about ZFS. Matt co-founded the ZFS project at Sun Microsystems in 2001. And 20 years later Adam picked up ZFS for use in his home lab and loved it. So, he reached out to Matt and invited him on the show. They cover the origins of the file system, its journey from proprietary to open source, architecture choices like copy-on-write, the ins and outs of creating and managing ZFS, RAID-Z and RAID-Z expansion, and Matt even shares plans for ZFS in the cloud with ZFS object store.


Sign in or Join to comment or subscribe

2022-01-24T18:33:09Z ago

<sarcasm>“No one has yet to be sued for using ZFS on Linux” is exactly the kind of assertion my employer needed to authorize us to use ZFS</Sarcasm>

And now seriously - the one thing that was troubling me about this interview is the complete disregard of any possible existence of alternatives to ZFS: “any other alternative for mass storage is so complicated”, “ZFS is the only system that does copy on write”, “the features are unique”, etc.

Discussions with ZFS developers and enthusiast - and this one was no exception - always lack humility. If you listen to interviews with developer of other CoW file systems (yes - there are multiple others) you always hear a discussion of what others are doing in this space - including ZFS - and what are the differences, but in ZFS specific discussions its always either “ZFS is the only volume manager CoW file system” or “there are others but they are not worth discussing”.

For example - I use BTRFS instead of ZFS. It doesn’t have all the features that ZFS does - but it has others that ZFS doesn’t have and are more suited to my use case (large file system for a home server with a relatively high disk turn over) - for example: remove a disk to take it away; replace a disk with a larger capacity disk. BTRFS also has simpler set up and UI - you don’t need to figure out vdevs, pools and z file systems - you just create it on one drive and then add more drives (and also remove). And you still have volumes (what ZFS calls z file systems), CoW, checksumming, transparent compression, no-cost snapshotting, raid modes (though granted nothing like raid-z, which is pretty cool), clone, remote send/receive and other bells and whistles (it is also fast - - it is the fastest CoW file system).

Finally, I would like to take umbrage with Matt Ahrens’ assertion that there are only 2 types of data - discardable and data you care about (where it is implied that you mustn’t lose a single bit): it is actually a spectrum - some data is so unimportant that we call it a cache and some is so important that I can’t even trust it to raid-z2 and I will back it up to two different off-site locations, and there is also a lot in between where the required survivability is balanced with the cost, some examples:

  • copy of production data for use in development - losing it means spending a few weeks collecting it back, but the devs will have other things to do in the mean time.
  • important data that is already backed up properly elsewhere - if it dies it takes minutes to a few hours to recover.
  • A movie collection - worst case scenario we can torrent the lost videos and/or re-buy them.

Because of checksumming and SMART alerts, most times when a drive starts failing you’d be notified ahead of time and can replace if with minimal losses of data and time, and you can balance that cost against the extra cost of managing ZFS (compared to alternatives).

For a ZFS favorable comparison with BTRFS - that still acknowledges BTRFS wins, you may want to read this:

2022-01-24T20:13:54Z ago

Thanks for taking the time to listen. To address a few of your “serious” ;-) points:

Sounds like you want to hear more about other similar filesystems like BTRFS. Maybe you’re right that I should take more time to educate myself on what people love about it. I really would like to see other storage technologies bring new cool ideas and succeed - there’s room for lots of innovation and lots of complementary (and competing) projects. My bias is towards more enterprise-type deployments where RAIDZ is used almost universally, and we agree that BTRFS doesn’t have a similar capability. But I think that we filesystem writers can learn a lot from databases and other storage technologies that typically run on top of filesystems. The data structures used there are applicable to a lot of filesystem problems (e.g. Log Structured Merge trees, btrees (used to good effect in BTRFS), and b-epsilon trees).

Side note, you mentioned a couple BTRFS features that ZFS lacks. These might work more smoothly on BTRFS (I don’t know) but ZFS does have similar functionality:

  • “remove a disk to take it away” - If we restrict ourselves to the non-RAID5-like deployments that BTRFS handles (i.e. nonredundant and mirroring only), ZFS can do disk removal; it copies allocated space to the remaining drives. ZFS can’t remove a disk from a RAIDZ group/pool. See the zpool-remove manpage for details, we added this feature in 2016.
  • “replace a disk with a larger capacity disk” - ZFS has been able to do this since the beginning, see the zpool-replace manpage. (but if you have sufficient free space and want to remove the old disk before adding the new disk, you could use “zpool remove” mentioned above).

That said I would love to learn about your favorite BTRFS-exclusive features. I hear you on the simplicity of configuring simple BTRFS systems; ZFS does require learning some new concepts like pools vs filesystems.

Only 2 types of data - yeah, this is a bit of hyperbole; I appreciate that there’s a spectrum of data importance. The broader point I was trying to make is that filesystems that don’t have built-in data integrity (e.g. checksums, always consistent on disk) can be used successfully with less sensitive data - we don’t need to replace every use of FAT or ext with ZFS or BTRFS. But these features (common to ZFS and BTRFS!) should be considered requirements for data that we care about.

Player art
  0:00 / 0:00