Zfs: feature request: allow unloading a suspended pool from memory without exporting first

Created on 8 Oct 2016  路  25Comments  路  Source: openzfs/zfs

when a pool is encountering i/o errors (because of a software bug or hardware issue/vdev gone missing), zol suspends the pool until it is brought to functional state.

most of the time this is impossible and user must hard reboot the server (hardware button or echo b > sysrq-trigger)

the forced reboot does not export the pool but simply decouples the zfs code from the storage devices.

please add this feature natively so I can "unload" a pool without commiting the export operation to disk, to further be able to unload the zfs modules and try the latest git master version (for example) without having to reboot the os.

Blocked Understood Work in Progress Defect Feature

Most helpful comment

Wondering if this old issue might get some attention, dealing with it again today.

All 25 comments

this can solve #4003 #2023 #2878 #3256

shutdown hangs for me even when a non-root pool gets suspended

pinging @behlendorf

note that in #3256 even if you bring back the device, the pool remains stuck in suspended state

If someone is interested in working on this by all means let me know.

I did a bit of research into un-suspending a pool. One thing that seems to cause problems is using volatile names for the vdevs such as /dev/sdX. If a pool is created as raidz sdb sdc sdf and is suspended by removing sdb and sdf, and then the bus is rescanned so they come back, the pool won't unsuspend following a zpool clear (and/or zpool reopen) if they come back as different device nodes. Using stable names such as from /dev/disk/by-id appears to work just fine.

I realize the feature being requested here is to "abandon" a suspended pool but I figured I'd point out the issue with the vdev names. It seems the biggest issue in being able to abandon a suspended pool is dealing with unmounting its filesystems and dealing with any processes which may have its zvols open.

I'm commeting here because this issue appears to be the best defined of the bunch of related issues, some of which have been referenced here earlier.

After some private correspondence with others, I've seen two cases arise: The first, and best defined, is a pool which has been "cleanly" suspended via zio_suspend(). The second, and more difficult to define, is a pool in which there were sufficient checksum errors and, typically not enough pool redundancy to correct them, that a VERIFY fails and panics one of ZFS' threads; typically the sync task.

As an example of the second case, a pool with no redundancy had a spacemap in which one of its blkptr's 3 ditto blocks [EDIT] ~weren't properly~ were improperly written by the storage substrate as zeroes. This caused checksum errors the next time it was read and ultimately resulted in a panic due to dmu_buf_hold_array(), called in dmu_write(), returning an EIO and panicking due to a failed VERIFY. The events leading up to the panic were:

  • FM_EREPORT_ZFS_DATA
  • 3 instances of FM_EREPORT_ZFS_CHECKSUM, one for each ditto block
  • 3 more instance of FM_EREPORT_ZFS_CHECKSUM, one for each ditto block

That set of 4 events was repeated one more time prior to the panic.

The feature desired, as indicated in this issue, is to unload (I've been using the term "evacuate") a pool in either of these states to avoid rebooting. Presumably, the faulted pool is some sort of back-up or is otherwise not critical to the system's operation but it does continue to consume resources and can also cause processes to block, etc.

@behlendorf I plan on looking into this, beginning with what seems to be the more straightforward case of a pool that's simply been suspended. The second case is likelier to be trickier and, in thinking about it, if the system thinks a pool is toasted that bad (and in the case I described, the pool _was_ toast, although likely mostly recoverable with some careful gymnastics), it would argue for a non-panicking mode for non-debug builds in which the pool were put in some other type of suspended state in which the only option would be to evacuate it.

@dweeezil that's great. The suspended pool case is definitely something which can be handled better and should be relatively straightforward. It should largely be a matter of tearing down everything cleanly and returning errors to any active ZFS consumers.

As for the second case if a VERIFY or ASSERT is hit that is by design intended to be fatal and unrecoverable. For cases where it's possible to hit a VERIFY due to an IO error we're going to need to replace that VERIFY with proper error handling code. So for example dmu_write() will need to be updated to return an error code and all of it's callers updated to handle it correctly.

I did a bit of research into un-suspending a pool

@dweeezil is there any hope for a functional PR in a not so distant future? :-)
especially for the second case you described :D

@mailinglists35 Sorry, but this issue seems to have resisted percolating sufficiently high on my to-do list to get the attention it deserves. Are you mainly interested in the case where a VERIFY is hit and causes a panic? Your original report didn't have any zpool status or syslog output. I will try to give this and the referenced issues a fresh look in the next couple of days.

@dweeezil Are you mainly interested in the case where a VERIFY is hit and causes a panic? Your original report didn't have any zpool status or syslog output

sorry, I assumed referencing an example issue in the first comment after filing the issue should be enough.
I am interested in all the cases where zpool export is unable to return success, regardless of the cause, so the desired result is to be able to exit the situation (forcibly "drop" the pool in the same way as if I pressed the hardware reset button) without a reboot

Here is a relevant one from those enumerated initially, containing zpool status and dmesg

https://github.com/zfsonlinux/zfs/issues/3256

also (no logs, just status) https://github.com/zfsonlinux/zfs/issues/3461

I think this comment summarizes in a clearer language the feature request:

"@gordan-bobic commented on Dec 15, 2016
There really needs to be a way to instruct ZFS to throw away any and all dirty data and forget that the pool was ever here without rebooting the machine. Leaving a pool in a hung state with the disk removed is of no practical use. If there is risk of trashing the pool, so be it, but that risk doesn't seem any different from what happens if you reboot the machine, which is currently the only option anyway."

https://github.com/zfsonlinux/zfs/issues/3461#issuecomment-267311909

Hi @dweeezil
Just a kind ping in attempt to bump this a bit upper on the priority list :D :)

related comment

@mailinglists35 At the moment, I'm trying to get the device evacuation code to a point where it can be merged. It's been merged upstream but there are 2 lingering issues with ZoL. Other than the fact it's a killer feature, one of my main bits of interest is that it's a prerequisite to the overhauled spa import code which will ultimately allow a lot of interesting things to be done with different types of vdevs. After that's done, I really want to get back to getting the TRIM patch set merged (which has been languishing upstream as well for a very long time).

As to this issue, I'll try to dig up my WIP branch in which I was working on it earler. The enhanced deadman code may likely help matters with this issue. Speaking of which, the new deadman code (just committed to master on Jan 25, 2018) has zfs_deadman_failmode=continue which may very well allow for recovery from many cases in which dodgy hardware or other related issues would have caused a pool to become suspended in the past.

thank you! do you think @sanjeevbagewadi's diff can be used/integrated on your branch, or is it unrelated to this issue?

@dweeezil also do you think this is challenging enough to only see the light at the time of 1.0.0 release?

@mailinglists35 As you've likely noticed, I'm still grinding away on the device evacuation code. It does sound like @sanjeevbagewadi may be working on something similar based on the commentary in #6649 in which pool export is mentioned.

I just did a bit more looking around at some of the underlying issues. Among plenty of other things, the whole code base right now pretty much assumes that zio_wait(), dmu_tx_wait() and others will eventually return in order that progress can be made. Some of the callers to the former actually VERIFY that it returns a zero, others simply ignore the return value. The latter function is already void. Every call point to these functions (and probably others) would need to have some error checking added and be modified to unwind gracefully. Essentially, we'd have to have to mode in which the whole spa syncing process would be allowed to fail and then every single waiter would need to be unwound in some way. I've not even really thought about the zio pipeline, which might have thousands or 10's of thousands of outstanding requests. Then, of course, there are all the resources that would need to be freed in a state where freeing them is not currently anticipated. In other words, this sounds like a substantially more difficult job than I thought at first. I'm not saying it's impossible, but in the context of the way in which the code is currently structured, it's not going to be easy. One starting point it seems to me would be to have a mode in which attempts to access a suspended pool would simply fail immediately. That alone would eliminate a whole lot of context caused by blocked processes. Of course, it all works this way because it was anticipated that one would ultimately recover from a suspended state, and, BTW, that type of recovery does seem to work quite well nowadays. I was able to unplug a USB stick containing a pool, jam the system up with lots of blocked processes, re-insert the stick, do a zpool clear and it all started humming along nicely and eventually unblocked everything. It almost seems that the use case for this feature is geared toward pools with no redundancy. That said, a simple mirror with a currently undetected checksum error on one child vdev would have a problem if the rest of its child vdevs were currently inaccessible. I'd love to give this a better looking at, however, it seems that other things always come up and, at least for myself, it's not a killer feature.

This is unfortunate. I still run into this issue fairly regularly. A zpool clear has helped once or twice, but most of the time I still get a "cannot clear errors for [poolname]: I/O error". It's frustrating that a hard reboot is my only way to recover from this.

Ditto here, this feature is much needed. Please see my comments on our specific use case here: https://github.com/zfsonlinux/zfs/issues/3461#issuecomment-379474127

Suffering this a Lot, my setup zfs Luks USB should be a way to re-add or redetect drives automatically

Suffering this a Lot, my setup zfs Luks USB should be a way to re-add or redetect drives automatically

I have an external USB drive that is powered from an unreliable source.
device mapper is what helps me reconnect the drive to the pool without reboot:

this is how I do it:
https://gist.github.com/mailinglists35/65cf2f165f543243157c2aa573e75a49#gistcomment-3016376

you would have to add linux device mapper between LUKS and physical device (I *think you can do this without recreating your pool)
also in your case you would have to resetup the luks mapping, I guess, instead of veracrypt like I have

the magic is in being able to replace the physical device with the error dm target, which instantly kills any outstanding I/O (processeses then are exiting the D state) - https://wiki.gentoo.org/wiki/Device-mapper#Error

Suffering this a Lot, my setup zfs Luks USB should be a way to re-add or redetect drives automatically

I have an external USB drive that is powered from an unreliable source.
device mapper is what helps me reconnect the drive to the pool without reboot:

this is how I do it:
https://gist.github.com/mailinglists35/65cf2f165f543243157c2aa573e75a49#gistcomment-3016376

you would have to add LVM between LUKS and physical device (I *_think_ you can do this without recreating your pool)
also in your case you would have to resetup the luks mapping, I guess, instead of veracrypt like I have

the magic is in being able to replace the physical device with the error dm target, which instantly kills any outstanding I/O (processeses then are exiting the D state) - https://wiki.gentoo.org/wiki/Device-mapper#Error

with luks looks like is even easier.

anyway i fixed the usb discconections with this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1584557
Looks like kernel problems.

Wondering if this old issue might get some attention, dealing with it again today.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pcd1193182 picture pcd1193182  路  4Comments

geek-at picture geek-at  路  3Comments

kernelOfTruth picture kernelOfTruth  路  4Comments

jakeogh picture jakeogh  路  3Comments

schmurfy picture schmurfy  路  3Comments