Collectd: RFE: smart plugin to read NVMe disks' health logs

Created on 18 Apr 2018  Â·  11Comments  Â·  Source: collectd/collectd

Currently smart plugin uses libatasmart to read disk health status. NVMe is getting more popular and it would be useful to have collectd report their health too.

Feature request

Most helpful comment

Hi guys, we resumed work previously started by @mfijalko. Within the next few days PR will be created. We extended smart plugin to gather metrics from NVMe drives using ioctl.

Plugin will gather at least these generic metrics:

  • Critical Warning
  • Composite Temperature
  • Available Spare
  • Available Spare Threshold
  • Percentage Used
  • Endurance Group Critical Warning Summary
  • Data Units Read
  • Data Units Written
  • Host Read Commands
  • Host Write Commands
  • Controller Busy Time
  • Power Cycles
  • Power On Hours
  • Unsafe Shutdowns
  • Media and Data Integrity Errors
  • Number of Error Information Log Entries
  • Warning Composite Temperature Time
  • Critical Composite Temperature Time
  • Temperature Sensor 1
  • Temperature Sensor 2
  • Temperature Sensor 3
  • Temperature Sensor 4
  • Temperature Sensor 5
  • Temperature Sensor 6
  • Temperature Sensor 7
  • Temperature Sensor 8
  • Thermal Management Temperature 1 Transition Count
  • Thermal Management Temperature 2 Transition Count
  • Total Time For Thermal Management Temperature 1
  • Total Time For Thermal Management Temperature 2

And at least these Intel-specific metrics:

  • Program Fail Count
  • Erase Fail Count
  • Wear Leveling Count
  • End to End Error Detection Count
  • CRC Error Count
  • Timed Workload, Media Wear
  • Timed Workload, Host Reads
  • Timed Workload, Timer
  • Thermal Throttle Status
  • Retry Buffer Overflow Counter
  • PLL Lock Loss Count
  • NAND Bytes Written
  • Host Bytes Written

@shastah as you see, there will be more Temperature Sensors available.

What do you think?

All 11 comments

FYI I started working at this.
Right now I am including NVMe headers from linux kernel and send proper ioctl's.
Here’s some test output from it – this was performed on a VM with emulated NVMe disk – that’s enough for checking the ioctl path:

[2018-04-20 17:33:37] smart plugin: checking SMART status of /dev/nvme0n1.
[2018-04-20 17:33:37] available_spare           : 0%
[2018-04-20 17:33:37] available_spare_threshold : 20%
[2018-04-20 17:33:37] percentage_used           : 0%
[2018-04-20 17:33:37] data_units_read           : 196397588482
[2018-04-20 17:33:37] data_units_written        : 205791232
[2018-04-20 17:33:37] host_read_commands        : 19172
[2018-04-20 17:33:37] host_write_commands       : 213
[2018-04-20 17:33:37] controller_busy_time      : 0
[2018-04-20 17:33:37] power_cycles              : 0
[2018-04-20 17:33:37] power_on_hours            : 72
[2018-04-20 17:33:37] unsafe_shutdowns          : 0
[2018-04-20 17:33:37] media_errors              : 0
[2018-04-20 17:33:37] num_err_log_entries       : 0

Questions to community about implementation, before I go further:

  • will having a raw ioctls in smart plugin be accepted?
  • if not, do we want to extend the libatasmart with NVMe support? It looks like it is not maintained for quite a lot of time (few years), so I would argue about having patches accepted there. Oh, and with NVMe support the better name would be just libsmart :)
  • one last path would be to get rid of libatasmart and have ioctls in smart plugin for both SATA and NVMe

There is a smartmontools with NVMe smart support but is it not available as a library (and I guess it won't be - see https://www.smartmontools.org/ticket/501).

Hi!

Thanks for your work on Collectd.

Can you please create PR for this? We can see your work in progress then, and that will allow to put suggestions and so on.

Short googling finds https://github.com/hgst/libnvme.
May be that project can be useful in this work, or vice versa?

Here’s some test output from it – this was performed on a VM with emulated NVMe disk

Can you please provide short guide about your setup?

@rpv-tomsk following up on this. Will post updates soon. Thanks for your review.

any progress?

@NebulaNeko This item is in our backlog as we are focusing on enabling few other plugins at the moment.
What would help our team is if anyone could share how this would help them. Any info on how you would use this for their deployments would help.

The following list of metrics are considered for this plugin.
Let us know if there are any specific metric(s) that you dont see below:

Composite Temperature
Available Spare
Available Spare Threshold
Percentage Used
Data Unit Read
Data Unit Written
Host Read Commands
Host Write Commands
Controller Busy Time
Power Cycles
Power On Hours
Unsafe Shutdowns
Media and Data Integrity Errors
Number of Error Information Log Entries

That looks good.

One of the NVMe disks at my disposal (Samsung SSD 960 EVO) reports data from two temperature sensors:

# nvme --smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 39 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 7%
data_units_read                     : 3,024,557
data_units_written                  : 87,710,216
host_read_commands                  : 158,877,287
host_write_commands                 : 1,939,820,563
controller_busy_time                : 25,433
power_cycles                        : 12
power_on_hours                      : 8,718
unsafe_shutdowns                    : 5
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 39 C    <-----------
Temperature Sensor 2                : 48 C    <-----------
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0

It would be nice to be able to collect both of them

imo at least critical_warning would also be good to watch

Hi guys, we resumed work previously started by @mfijalko. Within the next few days PR will be created. We extended smart plugin to gather metrics from NVMe drives using ioctl.

Plugin will gather at least these generic metrics:

  • Critical Warning
  • Composite Temperature
  • Available Spare
  • Available Spare Threshold
  • Percentage Used
  • Endurance Group Critical Warning Summary
  • Data Units Read
  • Data Units Written
  • Host Read Commands
  • Host Write Commands
  • Controller Busy Time
  • Power Cycles
  • Power On Hours
  • Unsafe Shutdowns
  • Media and Data Integrity Errors
  • Number of Error Information Log Entries
  • Warning Composite Temperature Time
  • Critical Composite Temperature Time
  • Temperature Sensor 1
  • Temperature Sensor 2
  • Temperature Sensor 3
  • Temperature Sensor 4
  • Temperature Sensor 5
  • Temperature Sensor 6
  • Temperature Sensor 7
  • Temperature Sensor 8
  • Thermal Management Temperature 1 Transition Count
  • Thermal Management Temperature 2 Transition Count
  • Total Time For Thermal Management Temperature 1
  • Total Time For Thermal Management Temperature 2

And at least these Intel-specific metrics:

  • Program Fail Count
  • Erase Fail Count
  • Wear Leveling Count
  • End to End Error Detection Count
  • CRC Error Count
  • Timed Workload, Media Wear
  • Timed Workload, Host Reads
  • Timed Workload, Timer
  • Thermal Throttle Status
  • Retry Buffer Overflow Counter
  • PLL Lock Loss Count
  • NAND Bytes Written
  • Host Bytes Written

@shastah as you see, there will be more Temperature Sensors available.

What do you think?

any progress?

Was this page helpful?
0 / 5 - 0 ratings