Node_exporter: 1.0.0-rc.0 CentOS 7 mdadm exporter not reporting failed drive

Created on 27 Mar 2020  路  3Comments  路  Source: prometheus/node_exporter

Host operating system: output of uname -a

CentOS Linux release 7.7.1908 (Core):

[justin@vps-compute0 ~]# uname -a
Linux vps-compute0 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.0.0-rc.0 (branch: HEAD, revision: ef7c05816adcb0e8923defe34e97f6afcce0a939)
  build user:       root@b38a2df1a38b
  build date:       20200220-12:54:05
  go version:       go1.13.8

node_exporter command line flags

node_exporter --collector.mdadm --web.listen-address=:9100

P.S. I tried running with and without --collector.mdadm, however the same issue appeared.

Are you running node_exporter in Docker?


No.

What did you do that produced an error?

While node_exporter was running, a drive in the software RAID array failed. This affected both of my software raid devices, md126 and md127.

What did you expect to see?

The metrics node_md_disks or node_md_state to change from "active" = 1 to "failed/spare/inactive" = 1.

What did you see instead?

Both node_md_disks and node_md_state remained with "active" = 1, which meant I was unable to detect a disk failure from the metrics.

Extra debug:

mdadm version: mdadm - v4.1 - 2018-10-01
Disk failure in syslog:

Mar 27 15:37:03 vps-compute0 kernel: md: super_written gets error=-5, uptodate=0
Mar 27 15:37:03 vps-compute0 kernel: md/raid1:md126: Disk failure on sdd1, disabling device.#012md/raid1:md126: Operation continuing on 1 devices.
Mar 27 15:37:07 vps-compute0 kernel: md/raid1:md127: Disk failure on sdd2, disabling device.#012md/raid1:md127: Operation continuing on 1 devices.

Unchanged node_exporter metrics before and after the failure:

# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md126",state="active"} 1
node_md_state{device="md126",state="inactive"} 0
node_md_state{device="md126",state="recovering"} 0
node_md_state{device="md126",state="resync"} 0
node_md_state{device="md127",state="active"} 1
node_md_state{device="md127",state="inactive"} 0
node_md_state{device="md127",state="recovering"} 0
node_md_state{device="md127",state="resync"} 0
# HELP node_md_disks Number of active/failed/spare disks of device.
# TYPE node_md_disks gauge
node_md_disks{device="md126",state="active"} 1
node_md_disks{device="md126",state="failed"} 0
node_md_disks{device="md126",state="spare"} 0
node_md_disks{device="md127",state="active"} 1
node_md_disks{device="md127",state="failed"} 0
node_md_disks{device="md127",state="spare"} 0



md5-8d0a1113b9ba187a8124f7b59c3a9301



# HELP node_md_disks_required Total number of disks of device.
# TYPE node_md_disks_required gauge
node_md_disks_required{device="md126"} 2
node_md_disks_required{device="md127"} 2



md5-8cc31d58efec15635b87d2ebc685d004



[root@vps-compute0 ~]# cat /proc/mdstat 
Personalities : [raid1] 
md126 : active raid1 sdc1[0]
      1557438464 blocks super 1.2 [2/1] [U_]
      bitmap: 4/12 pages [16KB], 65536KB chunk

md127 : active raid1 sdc2[0]
      1047552 blocks super 1.2 [2/1] [U_]
      bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>



md5-8d0a1113b9ba187a8124f7b59c3a9301



# mdadm --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Thu Feb 13 09:58:31 2020
        Raid Level : raid1
        Array Size : 1047552 (1023.00 MiB 1072.69 MB)
     Used Dev Size : 1047552 (1023.00 MiB 1072.69 MB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Mar 27 15:37:07 2020
             State : clean, degraded 
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : localhost:boot
              UUID : 4b1837c9:82597f5e:7e16e39b:1e2c4cc5
            Events : 42

    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdc2
       -       0        0        1      removed

Please let me know if this information was sufficient, and if you require any additional info.

Thanks for your time!

Most helpful comment

Hey all,

Thank you very much for the suggestions! Indeed, comparing node_md_disks{state="active"} and node_md_disks_required does the trick!

Just a heads up, to get my alert working fully, I had to add ignoring(state) in my expression since the metrics didn't have matching label values:

node_md_disks{state="active"} != ignoring(state) node_md_disks_required

That did the trick for me.

All 3 comments

Hrm. node_exporter usually tries to pass on the information from the kernel as verbatim as possible. The same seems to happen here. I guess the kernel has already removed the failed devices (according to the log message), which is why it doesn't mention them in /proc/mdstat anymore. However, this is also the file which node_exporter (more exactly: procfs) relies on.

The only indicator about the failure I can see in mdstat is the ratio of total devices (2) to active devices (1).

This should map to node_md_disks_required and node_md_disks{state="active"}. The latter one is not included in your pre-failure output. I assume it was listed as node_md_disks{device="md126",state="active"} 2 before the failure, at least this is what my healthy raid1 says and what the code says it should be. Can you confirm?

I guess you should alert on node_md_disks{state="active"} != node_md_disks_required. This should have detected the problem.

Maybe we could have generic types of alerts for mdadm as follows:
1) critical alert with 'node_md_disks_required - ignoring (state) (node_md_disks{state="active"}) != 0' query
2) warning alert with node_md_disks{state="fail"} > 0. This should fire even when array is configured with spare devices.

@hoffie @Justin417 WDYT?


@beorn7 maybe we could include those in node mixins after node-exporter v1.0.0 is released?

Hey all,

Thank you very much for the suggestions! Indeed, comparing node_md_disks{state="active"} and node_md_disks_required does the trick!

Just a heads up, to get my alert working fully, I had to add ignoring(state) in my expression since the metrics didn't have matching label values:

node_md_disks{state="active"} != ignoring(state) node_md_disks_required

That did the trick for me.

Was this page helpful?
0 / 5 - 0 ratings