Zfs: SATA trim for vdev members

Created on 12 Mar 2012 · 39Comments · Source: openzfs/zfs

ZoL should issue SATA TRIM commands to the underlying block devices when blocks are freed.

For CACHE devices (unless persistend L2ARC may be implemented at some point in the future): on cache device activation (and deactivation, like zpool remove ) a TRIM command should free the whole space (minus needed persistent superblocks) available on such devices .

Only then SSDs will be hinted of unused blocks so their wear-leveling algorithms can function as intended.

Feature

Source

GregorKopka

👍4

Most helpful comment

If your SSD is failing, as @drescherjm said, TRIM won't help. What you want to do, and urgently is:
1) Stop writing anything to that SSD.
2) Copy everything off it immediately to safer storage. Start with the most important data first.
3) Replace the SSD with a new one.

Hopefully you are using RAID and backups.

Since this Github issue is for the ZFS TRIM feature and not for support, if you need any volunteer assistance with this, talk to me on Twitter at @aw_nz (https://twitter.com/aw_nz)

awnz on 4 Feb 2017

👍3

All 39 comments

I believe there was some work for this going on at Illumos. We should check and see what they have done, with luck we'll just be able to port their changes with some modifications for Linux.

behlendorf on 12 Mar 2012

What's the story on this? I badly need to TRIM my underlying device here.

Rudd-O on 11 Jun 2012

Should be easy for L2ARC devices (just discard the whole device on add/import) and SLOG devices (discard the log when txg is synced). Not so easy for actual data devices - as I remember from some discussion on the original ZFS mailing list a few years ago, there was serious issues involved. Most importantly, immediately discarding unused blocks could make zpool disaster recovery (i.e. import -F) impossible.

dechamps on 11 Jun 2012

What about batched discard? ZFS can know which areas of the disks are not used by a process of elimination, so it should at least implement what the fstrim command needs to do its job.

Rudd-O on 11 Jun 2012

It should be possible to implement using the free space map. The only problem is that no one has time to do it.

ryao on 6 Jul 2012

One of the FreeBSD has developed this:

http://people.freebsd.org/~pjd/patches/zfstrim.patch

He told me that it is meant for single SSD pools. We only need to import it to gain TRIM support.

ryao on 13 Jul 2012

This patch seems to indeed implement the desired feature and it even delays discards by some TXGs to avoid breaking zpool import -F, so it should be good, but it's also quite complex and intrusive. Which is not an issue per se, but we should keep in mind that any bugs in this kind of code could have very dangerous consequences (i.e. data loss), especially if we accidentally issue a discard on real, non-freed data. This needs careful reviewing and heavy, thorough testing. I'm especially interested on how the author made sure to prevent some potentially devastating race conditions, like say, when discarding a range while it's being written again at the same time (for example due to command reordering in the block device queue).

So, is there any discussion on this patch somewhere? Code review? Test results? Googling "zfstrim.patch" doesn't seem to yield anything. Quite frankly, merging this patch without extensive code review and testing would be pure madness IMO.

Also, this patch doesn't discard whole cache and log devices on export/import, although that's easy to implement separately (and much less dangerous).

dechamps on 15 Jul 2012

@dechamps It was a funded project in FreeBSD. pjd told me that the people who funded it have put it into production use. You can email him @freebsd.org with questions.

By the way, do you intend to try porting this? It is on my to do list, although you probably would get to it sooner than I will.

ryao on 15 Jul 2012

@dechamps It was a funded project in FreeBSD. pjd told me that the people who funded it have put it into production use. You can email him @freebsd.org with questions.

Mail sent.

By the way, do you intend to try porting this? It is on my to do list, although you probably would get to it sooner than I will.

I'm not quite sure yet, it depends on what my work load will be in the coming week. Anyway, as I understand it, the patch is very easy to port: it seems the only thing to do is to replace FreeBSD's BIO_DELETE operation by Linux's BLKDISCARD.

dechamps on 15 Jul 2012

Here's pjd's answers to my questions:

I told him that this is mostly for SSD-only pools, not single-SSD pools. The customer I prepared this for is using this with pools that contain multiple SSDs.

Broadly speaking, how does your code work?

The code builds a map of regions that were freed. On every write the code consults the map and eventually removes ranges that were freed before, but are now overwritten.

Freed blocks are not TRIMed immediately. There is a tunable that defines how many txg we should wait with TRIMming freed blocks (64 by default).

There is a low priority thread that TRIMs ranges when the time comes. During TRIM we keep in-flight ranges on a list to detect colliding writes - we have to delay writes that collide with in-flight TRIMs in case something will be reordered and write will reached the disk before the TRIM. We don't have to do the same for in-flight writes, as colliding writes just remove ranges to TRIM.

What steps did you take to prevent potentially dangerous race conditions from happening (see my comment)?

The race you were talking about is handled by keeping a list of in-flight TRIMs, so we can detect that colliding write happens and delay such write until TRIM is done.

How has the code been tested?

It was and still is being tested by my customer. Some bugs were found and fixed.

Has the code been reviewed by anyone (especially experienced ZFS developers)?

No. At least not yet. The code is still pretty fresh.

Do your trust the code to be reliable and safe to use in production environments?

From what I know the customer of mine already put it in production or at least started some production tests on some limited number of machines. I agree the code could use some independent review and testing in different environments. For example it hasn't been tested with pools that contain log vdevs.

In light of this new information, I think we can proceed with the merge, but maybe we should disable the feature by default (zfs_notrim=1) and expose it as a module parameter marked experimental in the description. @behlendorf: what are your thoughts on this?

dechamps on 17 Jul 2012

I think we should proceed _very very very_ carefully here. We're talking about touching some delicate code with a fairly intrusive change. I know this would be a great feature, but it would be easy to get this subtly wrong.

Why don't we start by porting the FreeBSD patch to a ZoL branch. Then those who are interested in this feature can get some serious testing on the code before we merge it in. I'd rather not risk destabilizing master until we are more confident about this change.

behlendorf on 18 Jul 2012

I think that is very reasonable.

On Wed, 2012-07-18 at 12:55 -0700, Brian Behlendorf wrote:

I think we should proceed _very very very_ carefully here. We're talking about touching some delicate code with a fairly intrusive change. I know this would be a great feature, but it would be easy to get this subtly wrong.

Why don't we start by porting the FreeBSD patch to a ZoL branch. Then those who are interested in this feature can get some serious testing on the code before we merge it in. I'd rather not risk destabilizing master until we are more confident about this change.

Reply to this email directly or view it on GitHub:
https://github.com/zfsonlinux/zfs/issues/598#issuecomment-7080069

Rudd-O on 18 Jul 2012

I agree. I'm still not sure I'll have the time to come up with the branch though, so if someone wants to jump in, go ahead. Like I said, this should by fairly easy to port.

dechamps on 18 Jul 2012

I am in agreement with everyone else. My time is being spent on other things though. If someone does the initial port, I am willing to field test it on my desktop. The worst case is that I will have to restore it from a backup.

ryao on 19 Jul 2012

My laptop is willing to test the patch. I regularly do backup.
The pool of my laptop has a single vdev, that's a luks device, over a partition on a OCZ Agility 3.
ZFS-over-LUKS (on a HD, died this afternoon) works beautifully.

maxximino on 19 Jul 2012

I can probably test this too. I, too, have ZFS-over-LUKS on an SSD,
fully replicated to my NAS for safety, and the LUKS code I personally
hacked to enable discards, so this should be EENTERESTING. Just point
me to a branch, I will merge it into a branch of my own systemd-enabled
ZFS on Linux, and I'll give it a shot.

On Thu, 2012-07-19 at 12:02 -0700, Massimo Maggi wrote:

My laptop is willing to test the patch. I regularly do backup.
The pool of my laptop has a single vdev, that's a luks device, over a partition on a OCZ Agility 3.
ZFS-over-LUKS (on a HD, died this afternoon) works beautifully.

Reply to this email directly or view it on GitHub:
https://github.com/zfsonlinux/zfs/issues/598#issuecomment-7108066

Rudd-O on 19 Jul 2012

Here it is: #924. It definitely needs extensive testing, so if you're not afraid to corrupt your pool, go ahead.

dechamps on 1 Sep 2012

Status on this?

Rudd-O on 29 Jan 2013

Largely done and working, but it needs to be rebasesd on master and seriously tested. I definitely do want to get it merged before it gets much staler.

behlendorf on 29 Jan 2013

Maybe is time to rebase it? Perhaps a beta/experimental tarball version can be released to engage more people on testing the feature.

clopez on 12 Apr 2013

I figure I'm far enough along to mention I'm working on porting Nextenta's TRIM support. It uses a rather different approach than the the one described earlier in this issue. My plan is to try to get disk and file vdevs working first and then deal with zvols. It's a good ways off yet but so far is looking good. I'll post more information to this particular issue as it becomes available.

dweeezil on 24 Jul 2015

What do you mean by "deal with zvols"? Discards are already implemented for zvols and have been working for years now - see #334, #384, #1010. Discard on zvol is a frontend feature, which is very different from discard on vdevs, which is a backend feature.

dechamps on 24 Jul 2015

On Jul 24, 2015, at 1:46 PM, Tim Chase [email protected] wrote:

I figure I'm far enough along to mention I'm working on porting Nextenta's TRIM support. It uses a rather different approach than the the one described earlier in this issue. My plan is to try to get disk and file vdevs working first and then deal with zvols.

This makes no sense. Surely you mean initiating TRIM/UNMAP for vdevs?
That is totally different than zvols, as that work is already done.
-- richard

It's a good ways off yet but so far is looking good. I'll post more information to this particular issue as it becomes available.

—
Reply to this email directly or view it on GitHub https://github.com/zfsonlinux/zfs/issues/598#issuecomment-124717220.

richardelling on 24 Jul 2015

Let me clarify the part regarding zvols: I want to evaluate whether to try replacing the existing zvol discard code with that from Nexenta. The particular use case for which I was asked to do this work is for file vdevs so I figured on porting the code and testing with file and disk vdevs first and then look at the zvol situation later.

dweeezil on 24 Jul 2015

This appears to be coming together rather nicely and, so far, vdev file trimming (via hole-punching) appears to be working properly and, as expected, the existing zvol code works just fine (tested with a "discard"-mounted ext4 atop a zvol). I've not dealt with vdev disks yet but am instead running a bunch of stress tests with vdev file-based pools to make sure nothing has broken.

I'll also be porting Nexenta's on-demand trim feature.

The WIP spl code is in https://github.com/dweeezil/spl/tree/ntrim and the WIP zfs code is in https://github.com/dweeezil/zfs/tree/ntrim.

dweeezil on 26 Jul 2015

On Jul 26, 2015, at 8:23 AM, Tim Chase [email protected] wrote:

This appears to be coming together rather nicely and, so far, vdev file trimming (via hole-punching) appears to be working properly and, as expected, the existing zvol code works just fine (tested with a "discard"-mounted ext4 atop a zvol). I've not dealt with vdev disks yet but am instead running a bunch of stress tests with vdev file-based pools to make sure nothing has broken.

sigh, so many ZFS ports, so little time :-)
I had not realized that the ZoL code is so far behind illumos in this regard.
Yes, absolutely agree with what you are proposing.

FWIW, Solaris 11 has had vdev-level UNMAP/TRIM for years. By default it is disabled.
Given the large variance in drive and array performance during UNMAP/TRIM, it is
clear that default=enabled is not a good idea. Perhaps some day the drive vendors will
get it right...
-- richard

I'll also be porting Nexenta's on-demand trim feature.

The WIP spl code is in https://github.com/dweeezil/spl/tree/ntrim https://github.com/dweeezil/spl/tree/ntrim and the WIP zfs code is in https://github.com/dweeezil/zfs/tree/ntrim https://github.com/dweeezil/zfs/tree/ntrim.

—
Reply to this email directly or view it on GitHub https://github.com/zfsonlinux/zfs/issues/598#issuecomment-125007876.

richardelling on 26 Jul 2015

Out of curiosity will be there available an fstrim like (offline) solution?

tomposmiko on 27 Jul 2015

The patch set seems to be working at this point. Please see #3656.

@tomposmiko With the on-demand trim feature, zpool trim <pool> does the same thing, however, I'm going to look into adding support for the Linux standard FITRIM ioctl which would allow fstrim to work properly.

dweeezil on 2 Aug 2015

@tomposmiko I had forgotten how weird FITRIM is. Among other things, it targets a single filesystem which in ZFS really isn't possible since space allocation is a pool attribute. I think zpool trim ought to suffice.

dweeezil on 3 Aug 2015

@dweeezil theoretically you could perform reverse lookup from filesystem to pool(s) and the perform "zpool trim" on a result ...

Bronek on 6 Aug 2015

Would this work on ZFS over LUKs? We would need to enable the discard for the LUKs of course, but what else would be needed?

CMCDragonkai on 7 Dec 2015

@CMCDragonkai I currently run ZFS on LUKS and I just had to open the volume with --allow-discards (or discard when using crypttab). I set autotrim=on on the pool for automatic trimming.

dasJ on 22 Nov 2016

Hi. Can anyone provide some instructions, how to run TRIM version of ZFS in Gentoo? I urgently need it, my SSD is failing. Is it possible to add patches to portage or modify ebuild file?
@dasJ, how you running it? Maybe @dweeezil, you can help?
Thank you very much.
P.S. I already posted a question on another thread (nexenta), not sure if someone will answer soon

UralZima on 3 Feb 2017

I urgently need it, my SSD is failing.

I don't believe TRIM will help fix a failing SSD.

drescherjm on 3 Feb 2017

Hopefully you are using RAID and backups.

Since this Github issue is for the ZFS TRIM feature and not for support, if you need any volunteer assistance with this, talk to me on Twitter at @aw_nz (https://twitter.com/aw_nz)

awnz on 4 Feb 2017

👍3

Thanks for answers. I understood you, but anyway, I have heavy DB running on it, I don't want to trash newly bought SSD again. Of course I have backups/snaphots. I just want to enable/test TRIM on ZFS
@awnz, Thanks for your offer to help. I don't have a twitter, let's talk in this web chat room:
https://tlk.io/trimzfs anyone who interested, please join.

I think these instructions will help another testers/devs to test trim feature, so it can be on-topic here.
Is it enought to git clone ntrim branch of spl, ./configure && make && make install and do the same for zfs with ntrim-next branch? it will generate .ko files, and I need to move them to initramfs? or some cleaner way exists for gentoo?

Thank you very much.

UralZima on 4 Feb 2017

👎1

@behlendorf, @dweeezil; close (duplicate) because it's been superseded by #5925 ?