uname -aCentOS Linux release 7.7.1908 (Core):
[justin@vps-compute0 ~]# uname -a
Linux vps-compute0 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
node_exporter --versionnode_exporter, version 1.0.0-rc.0 (branch: HEAD, revision: ef7c05816adcb0e8923defe34e97f6afcce0a939)
build user: root@b38a2df1a38b
build date: 20200220-12:54:05
go version: go1.13.8
node_exporter --collector.mdadm --web.listen-address=:9100
P.S. I tried running with and without --collector.mdadm, however the same issue appeared.
No.
While node_exporter was running, a drive in the software RAID array failed. This affected both of my software raid devices, md126 and md127.
The metrics node_md_disks or node_md_state to change from "active" = 1 to "failed/spare/inactive" = 1.
Both node_md_disks and node_md_state remained with "active" = 1, which meant I was unable to detect a disk failure from the metrics.
mdadm version: mdadm - v4.1 - 2018-10-01
Disk failure in syslog:
Mar 27 15:37:03 vps-compute0 kernel: md: super_written gets error=-5, uptodate=0
Mar 27 15:37:03 vps-compute0 kernel: md/raid1:md126: Disk failure on sdd1, disabling device.#012md/raid1:md126: Operation continuing on 1 devices.
Mar 27 15:37:07 vps-compute0 kernel: md/raid1:md127: Disk failure on sdd2, disabling device.#012md/raid1:md127: Operation continuing on 1 devices.
Unchanged node_exporter metrics before and after the failure:
# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md126",state="active"} 1
node_md_state{device="md126",state="inactive"} 0
node_md_state{device="md126",state="recovering"} 0
node_md_state{device="md126",state="resync"} 0
node_md_state{device="md127",state="active"} 1
node_md_state{device="md127",state="inactive"} 0
node_md_state{device="md127",state="recovering"} 0
node_md_state{device="md127",state="resync"} 0
# HELP node_md_disks Number of active/failed/spare disks of device.
# TYPE node_md_disks gauge
node_md_disks{device="md126",state="active"} 1
node_md_disks{device="md126",state="failed"} 0
node_md_disks{device="md126",state="spare"} 0
node_md_disks{device="md127",state="active"} 1
node_md_disks{device="md127",state="failed"} 0
node_md_disks{device="md127",state="spare"} 0
md5-8d0a1113b9ba187a8124f7b59c3a9301
# HELP node_md_disks_required Total number of disks of device.
# TYPE node_md_disks_required gauge
node_md_disks_required{device="md126"} 2
node_md_disks_required{device="md127"} 2
md5-8cc31d58efec15635b87d2ebc685d004
[root@vps-compute0 ~]# cat /proc/mdstat
Personalities : [raid1]
md126 : active raid1 sdc1[0]
1557438464 blocks super 1.2 [2/1] [U_]
bitmap: 4/12 pages [16KB], 65536KB chunk
md127 : active raid1 sdc2[0]
1047552 blocks super 1.2 [2/1] [U_]
bitmap: 0/1 pages [0KB], 65536KB chunk
unused devices: <none>
md5-8d0a1113b9ba187a8124f7b59c3a9301
# mdadm --detail /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Thu Feb 13 09:58:31 2020
Raid Level : raid1
Array Size : 1047552 (1023.00 MiB 1072.69 MB)
Used Dev Size : 1047552 (1023.00 MiB 1072.69 MB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri Mar 27 15:37:07 2020
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Name : localhost:boot
UUID : 4b1837c9:82597f5e:7e16e39b:1e2c4cc5
Events : 42
Number Major Minor RaidDevice State
0 8 34 0 active sync /dev/sdc2
- 0 0 1 removed
Please let me know if this information was sufficient, and if you require any additional info.
Thanks for your time!
Hrm. node_exporter usually tries to pass on the information from the kernel as verbatim as possible. The same seems to happen here. I guess the kernel has already removed the failed devices (according to the log message), which is why it doesn't mention them in /proc/mdstat anymore. However, this is also the file which node_exporter (more exactly: procfs) relies on.
The only indicator about the failure I can see in mdstat is the ratio of total devices (2) to active devices (1).
This should map to node_md_disks_required and node_md_disks{state="active"}. The latter one is not included in your pre-failure output. I assume it was listed as node_md_disks{device="md126",state="active"} 2 before the failure, at least this is what my healthy raid1 says and what the code says it should be. Can you confirm?
I guess you should alert on node_md_disks{state="active"} != node_md_disks_required. This should have detected the problem.
Maybe we could have generic types of alerts for mdadm as follows:
1) critical alert with 'node_md_disks_required - ignoring (state) (node_md_disks{state="active"}) != 0' query
2) warning alert with node_md_disks{state="fail"} > 0. This should fire even when array is configured with spare devices.
@hoffie @Justin417 WDYT?
@beorn7 maybe we could include those in node mixins after node-exporter v1.0.0 is released?
Hey all,
Thank you very much for the suggestions! Indeed, comparing node_md_disks{state="active"} and node_md_disks_required does the trick!
Just a heads up, to get my alert working fully, I had to add ignoring(state) in my expression since the metrics didn't have matching label values:
node_md_disks{state="active"} != ignoring(state) node_md_disks_required
That did the trick for me.
Most helpful comment
Hey all,
Thank you very much for the suggestions! Indeed, comparing
node_md_disks{state="active"}andnode_md_disks_requireddoes the trick!Just a heads up, to get my alert working fully, I had to add
ignoring(state)in my expression since the metrics didn't have matching label values:That did the trick for me.