zpool status returns true, even on a completely faulted pool. This means that automated monitors and alerts are forced to rely on parsing text (which can change without warning) to evaluate pool health.
root@locutus:/data/test# zpool status test && echo $?
pool: test
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 0h0m with 0 errors on Thu May 19 16:30:13 2016
config:
NAME STATE READ WRITE CKSUM
test DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nbd0 UNAVAIL 0 0 0 corrupted data
nbd1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nbd2 ONLINE 0 0 0
nbd3 ONLINE 0 0 0
errors: No known data errors
0
In the above example, a DEGRADED pool's status still returns 0.
root@banshee:~# zpool status -x test ; echo $?
pool: test
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://zfsonlinux.org/msg/ZFS-8000-HC
scan: scrub in progress since Thu May 19 16:45:22 2016
187M scanned out of 187M at 2.13M/s, 0h0m to go
0 repaired, 99.87% done
config:
NAME STATE READ WRITE CKSUM
test UNAVAIL 0 0 0 insufficient replicas
mirror-0 UNAVAIL 0 0 0 insufficient replicas
nbd0 UNAVAIL 0 0 0 corrupted data
nbd1 UNAVAIL 0 0 0 corrupted data
mirror-1 ONLINE 0 0 0
nbd2 ONLINE 0 0 0
nbd3 ONLINE 0 0 0
errors: 1539 data errors, use '-v' for a list
0
And here we see even a completely UNAVAIL pool _still_ returning 0 from a status check.
I would really, really like to see zpool status returning a parseable exit code. An additional option for text output in a stable format designed for machine parsing (as well as predictable exit codes) would be even better.
Out of curiosity i tried to reproduce the same problem on illumos: its implementation of zpool status seems to behave the same way ZoL does:
[root@smartos]# zpool status -x dozer && echo $?
pool: dozer
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://illumos.org/msg/ZFS-8000-HC
scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 20 10:38:59 2016
config:
NAME STATE READ WRITE CKSUM
dozer UNAVAIL 0 0 0 insufficient replicas
mirror-0 UNAVAIL 0 0 0 insufficient replicas
9499654404053574007 UNAVAIL 0 0 0 was /dev/dsk/c4t0d0s0
c3t0d0 REMOVED 0 0 0
mirror-1 DEGRADED 0 0 0
c2t0d0 ONLINE 0 0 0
15981958288646757095 FAULTED 0 0 0 was /dev/dsk/c5t0d0s0
errors: No known data errors
0
This means that automated monitors and alerts are forced to rely on parsing text (which can change without warning) to evaluate pool health.
The latest FreeNAS release does exaclty that: https://github.com/freenas/freenas/blob/FN-9.10-RELEASE/gui/middleware/notifier.py#L5660
I don't know if we could just implement this and call it a day, maybe it should be discussed with the other OpenZFS members.
This is actually by design. The exit code indicates that the command returned without error not that the pool is healthy. You're going to need to parse the output to determine the status, this should get easier when the JSON support in #3938 is finalized. We could consider adding another command line option to change this behavior or a new sub-command. But I'd rather not change the expected long standing default behavior.
The problem I have with this is that the output format changes without
warning. In fact this has already happened once - last year I woke up to
100 bogus critical pool health warnings from my nagios network because of a
change in the text of zpool status after an automatic upgrade. :-\
(Sent from my phone - please blame any weird errors on autocorrect)
On May 20, 2016 1:10:59 PM Brian Behlendorf [email protected] wrote:
This is actually by design. The exit code indicates that the command
returned without error not that the pool is healthy. You're going to need
to parse the output to determine the status, this should get easier when
the JSON support in #3938 is finalized. We could consider adding another
command line option to change this behavior or a new sub-command. But I'd
rather not change the expected long standing default behavior.
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
https://github.com/zfsonlinux/zfs/issues/4670#issuecomment-220663914
Then let's add a reliable interface. That could be the JSON output which is structured and won't change or something else.
I would like this for the same reason. I also use nagios to monitor my servers.
Try: zpool get health
If you have the pool name, this is even simpler: zpool list -H -o health pool
Most helpful comment
Try: zpool get health