Node_exporter: Feature request: mdadm disk fail metric

Created on 22 Jun 2016 · 14Comments · Source: prometheus/node_exporter

Node exporter doesn't report amount of failed disks in mdadm, probably the most useful metric for this collector.

mdstat:

Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
md5 : active raid5 sda1[5] sdc1[0] sdb1[4](F) sdd1[1]
      8790400512 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]

md1 : active raid1 sde1[0] sdf1[1]
      58581824 blocks super 1.2 [2/2] [UU]

unused devices: <none>

current metrics:

# HELP node_md_blocks Total number of blocks on device.
# TYPE node_md_blocks gauge
node_md_blocks{device="md1"} 5.8581824e+07
node_md_blocks{device="md5"} 8.790400512e+09
# HELP node_md_blocks_synced Number of blocks synced on device.
# TYPE node_md_blocks_synced gauge
node_md_blocks_synced{device="md1"} 5.8581824e+07
node_md_blocks_synced{device="md5"} 8.790400512e+09
# HELP node_md_disks Total number of disks of device.
# TYPE node_md_disks gauge
node_md_disks{device="md1"} 2
node_md_disks{device="md5"} 4
# HELP node_md_disks_active Number of active disks of device.
# TYPE node_md_disks_active gauge
node_md_disks_active{device="md1"} 2
node_md_disks_active{device="md5"} 3
# HELP node_md_is_active Indicator whether the md-device is active or not.
# TYPE node_md_is_active gauge
node_md_is_active{device="md1"} 1
node_md_is_active{device="md5"} 1

P.S.: I do see the node_md_disks - node_md_disks_active calculation but not sure how should it work with hot spares.

accepted enhancement

Source

hryamzik

Most helpful comment

Maybe instead of node_md_disks and node_md_disks_active we should have a label value for state

node_md_disks{device="md5",state="active"} 4
node_md_disks{device="md5",state="failed"} 0
node_md_disks{device="md5",state="spare"} 1

SuperQ on 23 Jun 2016

👍17

All 14 comments

Maybe instead of node_md_disks and node_md_disks_active we should have a label value for state

node_md_disks{device="md5",state="active"} 4
node_md_disks{device="md5",state="failed"} 0
node_md_disks{device="md5",state="spare"} 1

SuperQ on 23 Jun 2016

👍17

May I ask if this issue is still pursued with PR https://github.com/prometheus/node_exporter/pull/492 closed?

frittentheke on 25 Apr 2018

👍1

I didn't manage to find corresponding PR and can't check on a real system now. However issue looks addressed, let's close it for now.

hryamzik on 25 Apr 2018

@hryamzik (and anyone else who is searching for why node_exporter doesn't have a metric for md software raid disk states (e.g. failed, active, etc)),

I found a few PRs with updates to add the disk states, but none of them got merged.. e.g.
https://github.com/prometheus/node_exporter/pull/648
https://github.com/prometheus/node_exporter/pull/492

Seems like there's more debate then consensus in those PRs and they get closed over time...
In the meantime, I have updatedn the md_info textcollector to include these disk states into node_md_info_* metrics... e.g. node_md_info, node_md_info_FailedDevices, node_md_info_WorkingDevices, etc...

See that PR and updated md_info textcollector here...
https://github.com/prometheus/node_exporter/pull/1204

mpursley on 26 Jan 2019

@mpursley You're right, the PRs weren't merged. As mentioned here: https://github.com/prometheus/node_exporter/pull/648#issuecomment-341387763
The consent was that we would like to have the functionality but we would like the functionality in it's own module that the node-exporter just uses. That could be part of the node-exporter but would be great if externally maintained.

discordianfish on 9 Feb 2019

👀1

Yeah, makes sense. Another option people can use in the mean time is this (now merged) text_collector script (running in a cronjob as root)...

https://github.com/prometheus/node_exporter/blob/master/text_collector_examples/md_info_detail.sh

Thanks @discordianfish

mpursley on 15 Feb 2019

👍1

Hi everyone. I was planning to extract away the stat-extraction complexity in mdadm_linux.go to a different repository, and calling functions from that repo in mdadm_linux.go to serve it with node_exporter. Does that sound good?

You-NeverKnow on 7 Jun 2019

@You-NeverKnow Yes, sounds great, we've been moving all of the generic /proc and /sys parsing to prometheus/procfs.

SuperQ on 10 Jun 2019

I'm confused. I was under the impression that we were using GET_ARRAY_INFO IOCTL call to retrieve raid array statuses according to this comment: https://github.com/prometheus/node_exporter/pull/648#issuecomment-323578564.

Would you rather want to use the mdstat parser in procfs instead?

You-NeverKnow on 10 Jun 2019

648 was never merged, so we're still just doing proc file parsing.

We could go the syscall method route, or the parsing route. I haven't looked at the mdadm stuff recently, but afaik there's some stuff you can only get with parsing. Also, we need to make sure any syscalls are available as non-root. We don't allow code in the node_exporter that requires root level access for safety.

SuperQ on 10 Jun 2019

Just came across this very useful feature request. Anything changed since last year?

For users:
Link was broken it's now:
script: https://github.com/prometheus-community/node-exporter-textfile-collector-scripts
doc: https://github.com/prometheus/node_exporter#textfile-collector

judos on 29 Mar 2020

Just came across this very useful feature request. Anything changed since last year?

Yes, I would say so. #1403 was merged, which included the refactoring into procfs and the addition of the state label. I think the merge of that PR was also supposed to close this issue? @SuperQ

The change is part of v1.0.0-rc.0.
Technically, it should be possible to see disks in status "failed" now. However, this is only true as long as the kernel has not removed the failed disks from the array (see #1655).

hoffie on 29 Mar 2020

👍1

Awesome I changed to use the v1.0.0-rc.0 version and I get the metrics I wanted.
e.g.:

node_md_disks{device="md0",state="active"} 2
node_md_disks{device="md0",state="failed"} 0
node_md_disks{device="md0",state="spare"} 0
# HELP node_md_disks_required Total number of disks of device.
# TYPE node_md_disks_required gauge
node_md_disks_required{device="md0"} 2
# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md0",state="active"} 1
node_md_state{device="md0",state="inactive"} 0
node_md_state{device="md0",state="recovering"} 0
node_md_state{device="md0",state="resync"} 0

I would also consider the ticket closed 😄 Thanks for the fast help!

judos on 29 Mar 2020

Great and thanks for confirming. Closing.

discordianfish on 17 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings