Node exporter doesn't report amount of failed disks in mdadm, probably the most useful metric for this collector.
mdstat:
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
md5 : active raid5 sda1[5] sdc1[0] sdb1[4](F) sdd1[1]
8790400512 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
md1 : active raid1 sde1[0] sdf1[1]
58581824 blocks super 1.2 [2/2] [UU]
unused devices: <none>
current metrics:
# HELP node_md_blocks Total number of blocks on device.
# TYPE node_md_blocks gauge
node_md_blocks{device="md1"} 5.8581824e+07
node_md_blocks{device="md5"} 8.790400512e+09
# HELP node_md_blocks_synced Number of blocks synced on device.
# TYPE node_md_blocks_synced gauge
node_md_blocks_synced{device="md1"} 5.8581824e+07
node_md_blocks_synced{device="md5"} 8.790400512e+09
# HELP node_md_disks Total number of disks of device.
# TYPE node_md_disks gauge
node_md_disks{device="md1"} 2
node_md_disks{device="md5"} 4
# HELP node_md_disks_active Number of active disks of device.
# TYPE node_md_disks_active gauge
node_md_disks_active{device="md1"} 2
node_md_disks_active{device="md5"} 3
# HELP node_md_is_active Indicator whether the md-device is active or not.
# TYPE node_md_is_active gauge
node_md_is_active{device="md1"} 1
node_md_is_active{device="md5"} 1
P.S.: I do see the node_md_disks - node_md_disks_active calculation but not sure how should it work with hot spares.
Maybe instead of node_md_disks and node_md_disks_active we should have a label value for state
node_md_disks{device="md5",state="active"} 4
node_md_disks{device="md5",state="failed"} 0
node_md_disks{device="md5",state="spare"} 1
May I ask if this issue is still pursued with PR https://github.com/prometheus/node_exporter/pull/492 closed?
I didn't manage to find corresponding PR and can't check on a real system now. However issue looks addressed, let's close it for now.
@hryamzik (and anyone else who is searching for why node_exporter doesn't have a metric for md software raid disk states (e.g. failed, active, etc)),
I found a few PRs with updates to add the disk states, but none of them got merged.. e.g.
https://github.com/prometheus/node_exporter/pull/648
https://github.com/prometheus/node_exporter/pull/492
Seems like there's more debate then consensus in those PRs and they get closed over time...
In the meantime, I have updatedn the md_info textcollector to include these disk states into node_md_info_* metrics... e.g. node_md_info, node_md_info_FailedDevices, node_md_info_WorkingDevices, etc...
See that PR and updated md_info textcollector here...
https://github.com/prometheus/node_exporter/pull/1204
@mpursley You're right, the PRs weren't merged. As mentioned here: https://github.com/prometheus/node_exporter/pull/648#issuecomment-341387763
The consent was that we would like to have the functionality but we would like the functionality in it's own module that the node-exporter just uses. That could be part of the node-exporter but would be great if externally maintained.
Yeah, makes sense. Another option people can use in the mean time is this (now merged) text_collector script (running in a cronjob as root)...
https://github.com/prometheus/node_exporter/blob/master/text_collector_examples/md_info_detail.sh
Thanks @discordianfish
Hi everyone. I was planning to extract away the stat-extraction complexity in mdadm_linux.go to a different repository, and calling functions from that repo in mdadm_linux.go to serve it with node_exporter. Does that sound good?
@You-NeverKnow Yes, sounds great, we've been moving all of the generic /proc and /sys parsing to prometheus/procfs.
I'm confused. I was under the impression that we were using GET_ARRAY_INFO IOCTL call to retrieve raid array statuses according to this comment: https://github.com/prometheus/node_exporter/pull/648#issuecomment-323578564.
Would you rather want to use the mdstat parser in procfs instead?
We could go the syscall method route, or the parsing route. I haven't looked at the mdadm stuff recently, but afaik there's some stuff you can only get with parsing. Also, we need to make sure any syscalls are available as non-root. We don't allow code in the node_exporter that requires root level access for safety.
Just came across this very useful feature request. Anything changed since last year?
For users:
Link was broken it's now:
script: https://github.com/prometheus-community/node-exporter-textfile-collector-scripts
doc: https://github.com/prometheus/node_exporter#textfile-collector
Just came across this very useful feature request. Anything changed since last year?
Yes, I would say so. #1403 was merged, which included the refactoring into procfs and the addition of the state label. I think the merge of that PR was also supposed to close this issue? @SuperQ
The change is part of v1.0.0-rc.0.
Technically, it should be possible to see disks in status "failed" now. However, this is only true as long as the kernel has not removed the failed disks from the array (see #1655).
Awesome I changed to use the v1.0.0-rc.0 version and I get the metrics I wanted.
e.g.:
node_md_disks{device="md0",state="active"} 2
node_md_disks{device="md0",state="failed"} 0
node_md_disks{device="md0",state="spare"} 0
# HELP node_md_disks_required Total number of disks of device.
# TYPE node_md_disks_required gauge
node_md_disks_required{device="md0"} 2
# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md0",state="active"} 1
node_md_state{device="md0",state="inactive"} 0
node_md_state{device="md0",state="recovering"} 0
node_md_state{device="md0",state="resync"} 0
I would also consider the ticket closed 馃槃 Thanks for the fast help!
Great and thanks for confirming. Closing.
Most helpful comment
Maybe instead of
node_md_disksandnode_md_disks_activewe should have a label value forstate