Node_exporter: Feature request: mdadm disk fail metric

Created on 22 Jun 2016  路  14Comments  路  Source: prometheus/node_exporter

Node exporter doesn't report amount of failed disks in mdadm, probably the most useful metric for this collector.

mdstat:

Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
md5 : active raid5 sda1[5] sdc1[0] sdb1[4](F) sdd1[1]
      8790400512 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]

md1 : active raid1 sde1[0] sdf1[1]
      58581824 blocks super 1.2 [2/2] [UU]

unused devices: <none>

current metrics:

# HELP node_md_blocks Total number of blocks on device.
# TYPE node_md_blocks gauge
node_md_blocks{device="md1"} 5.8581824e+07
node_md_blocks{device="md5"} 8.790400512e+09
# HELP node_md_blocks_synced Number of blocks synced on device.
# TYPE node_md_blocks_synced gauge
node_md_blocks_synced{device="md1"} 5.8581824e+07
node_md_blocks_synced{device="md5"} 8.790400512e+09
# HELP node_md_disks Total number of disks of device.
# TYPE node_md_disks gauge
node_md_disks{device="md1"} 2
node_md_disks{device="md5"} 4
# HELP node_md_disks_active Number of active disks of device.
# TYPE node_md_disks_active gauge
node_md_disks_active{device="md1"} 2
node_md_disks_active{device="md5"} 3
# HELP node_md_is_active Indicator whether the md-device is active or not.
# TYPE node_md_is_active gauge
node_md_is_active{device="md1"} 1
node_md_is_active{device="md5"} 1

P.S.: I do see the node_md_disks - node_md_disks_active calculation but not sure how should it work with hot spares.

accepted enhancement

Most helpful comment

Maybe instead of node_md_disks and node_md_disks_active we should have a label value for state

node_md_disks{device="md5",state="active"} 4
node_md_disks{device="md5",state="failed"} 0
node_md_disks{device="md5",state="spare"} 1

All 14 comments

Maybe instead of node_md_disks and node_md_disks_active we should have a label value for state

node_md_disks{device="md5",state="active"} 4
node_md_disks{device="md5",state="failed"} 0
node_md_disks{device="md5",state="spare"} 1

May I ask if this issue is still pursued with PR https://github.com/prometheus/node_exporter/pull/492 closed?

I didn't manage to find corresponding PR and can't check on a real system now. However issue looks addressed, let's close it for now.

@hryamzik (and anyone else who is searching for why node_exporter doesn't have a metric for md software raid disk states (e.g. failed, active, etc)),

I found a few PRs with updates to add the disk states, but none of them got merged.. e.g.
https://github.com/prometheus/node_exporter/pull/648
https://github.com/prometheus/node_exporter/pull/492

Seems like there's more debate then consensus in those PRs and they get closed over time...
In the meantime, I have updatedn the md_info textcollector to include these disk states into node_md_info_* metrics... e.g. node_md_info, node_md_info_FailedDevices, node_md_info_WorkingDevices, etc...

See that PR and updated md_info textcollector here...
https://github.com/prometheus/node_exporter/pull/1204

@mpursley You're right, the PRs weren't merged. As mentioned here: https://github.com/prometheus/node_exporter/pull/648#issuecomment-341387763
The consent was that we would like to have the functionality but we would like the functionality in it's own module that the node-exporter just uses. That could be part of the node-exporter but would be great if externally maintained.

Yeah, makes sense. Another option people can use in the mean time is this (now merged) text_collector script (running in a cronjob as root)...

https://github.com/prometheus/node_exporter/blob/master/text_collector_examples/md_info_detail.sh

Thanks @discordianfish

Hi everyone. I was planning to extract away the stat-extraction complexity in mdadm_linux.go to a different repository, and calling functions from that repo in mdadm_linux.go to serve it with node_exporter. Does that sound good?

@You-NeverKnow Yes, sounds great, we've been moving all of the generic /proc and /sys parsing to prometheus/procfs.

I'm confused. I was under the impression that we were using GET_ARRAY_INFO IOCTL call to retrieve raid array statuses according to this comment: https://github.com/prometheus/node_exporter/pull/648#issuecomment-323578564.

Would you rather want to use the mdstat parser in procfs instead?

648 was never merged, so we're still just doing proc file parsing.

We could go the syscall method route, or the parsing route. I haven't looked at the mdadm stuff recently, but afaik there's some stuff you can only get with parsing. Also, we need to make sure any syscalls are available as non-root. We don't allow code in the node_exporter that requires root level access for safety.

Just came across this very useful feature request. Anything changed since last year?

For users:
Link was broken it's now:
script: https://github.com/prometheus-community/node-exporter-textfile-collector-scripts
doc: https://github.com/prometheus/node_exporter#textfile-collector

Just came across this very useful feature request. Anything changed since last year?

Yes, I would say so. #1403 was merged, which included the refactoring into procfs and the addition of the state label. I think the merge of that PR was also supposed to close this issue? @SuperQ

The change is part of v1.0.0-rc.0.
Technically, it should be possible to see disks in status "failed" now. However, this is only true as long as the kernel has not removed the failed disks from the array (see #1655).

Awesome I changed to use the v1.0.0-rc.0 version and I get the metrics I wanted.
e.g.:

node_md_disks{device="md0",state="active"} 2
node_md_disks{device="md0",state="failed"} 0
node_md_disks{device="md0",state="spare"} 0
# HELP node_md_disks_required Total number of disks of device.
# TYPE node_md_disks_required gauge
node_md_disks_required{device="md0"} 2
# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md0",state="active"} 1
node_md_state{device="md0",state="inactive"} 0
node_md_state{device="md0",state="recovering"} 0
node_md_state{device="md0",state="resync"} 0

I would also consider the ticket closed 馃槃 Thanks for the fast help!

Great and thanks for confirming. Closing.

Was this page helpful?
0 / 5 - 0 ratings