Zfs: zpool status always returns true

Created on 19 May 2016 · 7Comments · Source: openzfs/zfs

zpool status returns true, even on a completely faulted pool. This means that automated monitors and alerts are forced to rely on parsing text (which can change without warning) to evaluate pool health.

root@locutus:/data/test# zpool status test && echo $?
  pool: test
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Thu May 19 16:30:13 2016
config:

    NAME        STATE     READ WRITE CKSUM
    test        DEGRADED     0     0     0
      mirror-0  DEGRADED     0     0     0
        nbd0    UNAVAIL      0     0     0  corrupted data
        nbd1    ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        nbd2    ONLINE       0     0     0
        nbd3    ONLINE       0     0     0

errors: No known data errors
0

In the above example, a DEGRADED pool's status still returns 0.

root@banshee:~# zpool status -x test ; echo $?
  pool: test
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: scrub in progress since Thu May 19 16:45:22 2016
    187M scanned out of 187M at 2.13M/s, 0h0m to go
    0 repaired, 99.87% done
config:

    NAME        STATE     READ WRITE CKSUM
    test        UNAVAIL      0     0     0  insufficient replicas
      mirror-0  UNAVAIL      0     0     0  insufficient replicas
        nbd0    UNAVAIL      0     0     0  corrupted data
        nbd1    UNAVAIL      0     0     0  corrupted data
      mirror-1  ONLINE       0     0     0
        nbd2    ONLINE       0     0     0
        nbd3    ONLINE       0     0     0

errors: 1539 data errors, use '-v' for a list
0

And here we see even a completely UNAVAIL pool _still_ returning 0 from a status check.

I would really, really like to see zpool status returning a parseable exit code. An additional option for text output in a stable format designed for machine parsing (as well as predictable exit codes) would be even better.

Source

jimsalterjrs

Most helpful comment

Try: zpool get health

richardelling on 20 May 2016

👍2

All 7 comments

Out of curiosity i tried to reproduce the same problem on illumos: its implementation of zpool status seems to behave the same way ZoL does:

[root@smartos]# zpool status -x dozer && echo $?
  pool: dozer
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
  scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 20 10:38:59 2016
config:

        NAME                      STATE     READ WRITE CKSUM
        dozer                     UNAVAIL      0     0     0  insufficient replicas
          mirror-0                UNAVAIL      0     0     0  insufficient replicas
            9499654404053574007   UNAVAIL      0     0     0  was /dev/dsk/c4t0d0s0
            c3t0d0                REMOVED      0     0     0
          mirror-1                DEGRADED     0     0     0
            c2t0d0                ONLINE       0     0     0
            15981958288646757095  FAULTED      0     0     0  was /dev/dsk/c5t0d0s0

errors: No known data errors
0

This means that automated monitors and alerts are forced to rely on parsing text (which can change without warning) to evaluate pool health.

The latest FreeNAS release does exaclty that: https://github.com/freenas/freenas/blob/FN-9.10-RELEASE/gui/middleware/notifier.py#L5660

I don't know if we could just implement this and call it a day, maybe it should be discussed with the other OpenZFS members.

loli10K on 20 May 2016

This is actually by design. The exit code indicates that the command returned without error not that the pool is healthy. You're going to need to parse the output to determine the status, this should get easier when the JSON support in #3938 is finalized. We could consider adding another command line option to change this behavior or a new sub-command. But I'd rather not change the expected long standing default behavior.

behlendorf on 20 May 2016

The problem I have with this is that the output format changes without
warning. In fact this has already happened once - last year I woke up to
100 bogus critical pool health warnings from my nagios network because of a
change in the text of zpool status after an automatic upgrade. :-\

(Sent from my phone - please blame any weird errors on autocorrect)

On May 20, 2016 1:10:59 PM Brian Behlendorf [email protected] wrote:

This is actually by design. The exit code indicates that the command
returned without error not that the pool is healthy. You're going to need
to parse the output to determine the status, this should get easier when
the JSON support in #3938 is finalized. We could consider adding another
command line option to change this behavior or a new sub-command. But I'd
rather not change the expected long standing default behavior.

You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
https://github.com/zfsonlinux/zfs/issues/4670#issuecomment-220663914