Zfs: Blocked I/O with failmode=continue

Created on 5 Oct 2018 · 14Comments · Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | Scientific Linux
Distribution Version | 7.5
Linux Kernel | 3.10.0-862.14.4.el7
Architecture | x86_64
ZFS Version | zfs-0.7.11-1.el7_5
SPL Version | 0.7.11-1.el7_5

Describe the problem you're observing

On a zpool with failmode=continue I/O continues to block resulting in un-killable application processes.

Describe how to reproduce the problem

zpool create data1 single_HDD
zpool set failmode=continue data1
Start applications performing I/O on zpool and wait for HDD to fail.
Attempt to kill application processes and note they end up in the Z state.

Include any warning/errors/backtraces from the system logs

[root@node2126 ~]# zpool list data1
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
data1  3.62T   564K  3.62T         -     0%     0%  1.00x  UNAVAIL  -

[root@node2126 ~]# zpool status data1
  pool: data1
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-JQ
  scan: none requested
config:

    NAME                      STATE     READ WRITE CKSUM
    data1                     UNAVAIL      0     0     0  insufficient replicas
      wwn-0x5000c5009cf653f1  FAULTED      6     0     0  too many errors
errors: List of errors unavailable: pool I/O is currently suspended

errors: 7 data errors, use '-v' for a list

After attempting to kill application pid 33345 it is blocked in the Zombie state and holding kernel resources I need to re-use (in particular a socket).

[root@node2126 ~]# top
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 33345 hdfs      20   0       0      0      0 Z   0.0  0.0   0:06.59 java

[root@node2126 ~]# lsof | awk '$2 == "33345"'
...
java       33345  35895               hdfs  877u     unix 0xffffa06b792d7000       0t0     146793 socket
java       33345  35895               hdfs  878uW     REG               0,41        31          2 /data1/in_use.lock

What I need is for failmode=continue to not block I/O and allow this process to exit so I can start another one to manage a replacement disk in a new pool without having to reboot, i.e., I don't need to be able to destroy the original zpool, though that would be nice as indicated in other open issues.

Understood Defect

Source

stuartthebruce

Most helpful comment

This is related to the work I'm doing to support the "abandonment" of a pool from which, for example, IO has "hung" because the completions are no longer arriving (due to flaky hardware, bad driver, etc.) and for which I worked up a proof-of-concept at the OpenZFS hackathon this year. This issue is sort-of a different instance of the problem (in which a pool can't be exported).

The work to support abandoning a pool for which IO has hung is going to leverage the similarly-named "continue" mode of the zio deadman. I've got a patch almost ready to post as a PR which fixes some of the problems with zio deadman.

This particular issue will require somewhat different handling but it _is_ something I've planned on addressing as part of the larger "zpool abandon" feature.

dweeezil on 7 Oct 2018

👍3

All 14 comments

I think this is a dup of https://github.com/zfsonlinux/zfs/issues/6649

richardelling on 5 Oct 2018

I think this is a dup of #6649

That ticket is for failmode=wait whereas this ticket is for failmode=continue.

stuartthebruce on 5 Oct 2018

yes, but failmode isn't the issue here. The issue is how to remove a suspended pool from the system.

richardelling on 5 Oct 2018

I was hoping that failmode=continue would obviate the need to wait for the enhancement to allow the removal of a suspended pool. My immediate need is to simply return an error and not block. I can live with an unusable suspended pool in the system until I need to reboot for another reason.

stuartthebruce on 6 Oct 2018

The issue seems to stem from the failmode=continue not aborting existing write requests (as one might expect), but only new ones. From man zpool:

continue
Returns EIO to any new write I/O requests but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked.

Also man zpool dosn't specify exactly what happens to reads from unhealthy devices (which, for a suspended pool, can possibly be _all of them_).

Thus, while the behaviour seen by the OP is kind-of as documented, failmode=continue is IMHO quite useless when effectively behaving identical to failmode=wait (hanging, unkillable I/O, for the in-flight ones when suspension occured) and should be made to cleanly abort _all_ outstanding I/O (writes _and_ reads) that can't complete as the pool went into suspension.

Possibly a timeout for zio in general (to abort them with a clean error condition on long enough time of inactivity) could solve the issue with I/O being stuck in an unkillable state?

GregorKopka on 7 Oct 2018

This particular issue will require somewhat different handling but it _is_ something I've planned on addressing as part of the larger "zpool abandon" feature.

dweeezil on 7 Oct 2018

👍3

This particular issue will require somewhat different handling but it _is_ something I've planned on addressing as part of the larger "zpool abandon" feature.

@dweeezil most excellent! I have a large number of unreliable HDD in a Hadoop cluster I would be willing to use to test a ZFS patch with when it is available. I most interested in the ability to optionally not block for, "pool I/O is currently suspended", however, I am also interested in testing the ability to abandon and destroy a zpool without having to reboot. Many thanks for working on this.

stuartthebruce on 8 Oct 2018

I would also like to test this feature, the deadman continue helps alot but sometimes I lose connection and wont be able to recover the connection and would like to not have to reboot.

bgly on 23 Oct 2018

WIP - Fix issues with zio deadman "continue" mode #8021

bgly on 25 Oct 2018

@dweeezil do you have a rough estimate on when external testing would be helpful?

stuartthebruce on 9 Jan 2019

Does 0.8.0 changes this behavior? Or make it any easier to implement a fix?

stuartthebruce on 24 Jun 2019

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.