Collectd: RFE: smart plugin to read NVMe disks' health logs

Created on 18 Apr 2018 · 11Comments · Source: collectd/collectd

Currently smart plugin uses libatasmart to read disk health status. NVMe is getting more popular and it would be useful to have collectd report their health too.

Feature request

Source

shastah

Most helpful comment

Hi guys, we resumed work previously started by @mfijalko. Within the next few days PR will be created. We extended smart plugin to gather metrics from NVMe drives using ioctl.

Plugin will gather at least these generic metrics:

Critical Warning
Composite Temperature
Available Spare
Available Spare Threshold
Percentage Used
Endurance Group Critical Warning Summary
Data Units Read
Data Units Written
Host Read Commands
Host Write Commands
Controller Busy Time
Power Cycles
Power On Hours
Unsafe Shutdowns
Media and Data Integrity Errors
Number of Error Information Log Entries
Warning Composite Temperature Time
Critical Composite Temperature Time
Temperature Sensor 1
Temperature Sensor 2
Temperature Sensor 3
Temperature Sensor 4
Temperature Sensor 5
Temperature Sensor 6
Temperature Sensor 7
Temperature Sensor 8
Thermal Management Temperature 1 Transition Count
Thermal Management Temperature 2 Transition Count
Total Time For Thermal Management Temperature 1
Total Time For Thermal Management Temperature 2

And at least these Intel-specific metrics:

Program Fail Count
Erase Fail Count
Wear Leveling Count
End to End Error Detection Count
CRC Error Count
Timed Workload, Media Wear
Timed Workload, Host Reads
Timed Workload, Timer
Thermal Throttle Status
Retry Buffer Overflow Counter
PLL Lock Loss Count
NAND Bytes Written
Host Bytes Written

@shastah as you see, there will be more Temperature Sensors available.

What do you think?

p-zak on 19 Jun 2020

👍5

All 11 comments

FYI I started working at this.
Right now I am including NVMe headers from linux kernel and send proper ioctl's.
Here’s some test output from it – this was performed on a VM with emulated NVMe disk – that’s enough for checking the ioctl path:

[2018-04-20 17:33:37] smart plugin: checking SMART status of /dev/nvme0n1.
[2018-04-20 17:33:37] available_spare           : 0%
[2018-04-20 17:33:37] available_spare_threshold : 20%
[2018-04-20 17:33:37] percentage_used           : 0%
[2018-04-20 17:33:37] data_units_read           : 196397588482
[2018-04-20 17:33:37] data_units_written        : 205791232
[2018-04-20 17:33:37] host_read_commands        : 19172
[2018-04-20 17:33:37] host_write_commands       : 213
[2018-04-20 17:33:37] controller_busy_time      : 0
[2018-04-20 17:33:37] power_cycles              : 0
[2018-04-20 17:33:37] power_on_hours            : 72
[2018-04-20 17:33:37] unsafe_shutdowns          : 0
[2018-04-20 17:33:37] media_errors              : 0
[2018-04-20 17:33:37] num_err_log_entries       : 0

Questions to community about implementation, before I go further:

will having a raw ioctls in smart plugin be accepted?
if not, do we want to extend the libatasmart with NVMe support? It looks like it is not maintained for quite a lot of time (few years), so I would argue about having patches accepted there. Oh, and with NVMe support the better name would be just libsmart :)
one last path would be to get rid of libatasmart and have ioctls in smart plugin for both SATA and NVMe

There is a smartmontools with NVMe smart support but is it not available as a library (and I guess it won't be - see https://www.smartmontools.org/ticket/501).

mfijalko on 25 Apr 2018

Hi!

Thanks for your work on Collectd.

Can you please create PR for this? We can see your work in progress then, and that will allow to put suggestions and so on.

Short googling finds https://github.com/hgst/libnvme.
May be that project can be useful in this work, or vice versa?

rpv-tomsk on 27 Apr 2018

Here’s some test output from it – this was performed on a VM with emulated NVMe disk

Can you please provide short guide about your setup?

rpv-tomsk on 27 Apr 2018

@rpv-tomsk following up on this. Will post updates soon. Thanks for your review.

sunkuranganath on 18 Mar 2019

any progress?

seiuneko on 1 May 2019

@NebulaNeko This item is in our backlog as we are focusing on enabling few other plugins at the moment.
What would help our team is if anyone could share how this would help them. Any info on how you would use this for their deployments would help.

sunkuranganath on 1 May 2019

The following list of metrics are considered for this plugin.
Let us know if there are any specific metric(s) that you dont see below:

Composite Temperature
Available Spare
Available Spare Threshold
Percentage Used
Data Unit Read
Data Unit Written
Host Read Commands
Host Write Commands
Controller Busy Time
Power Cycles
Power On Hours
Unsafe Shutdowns
Media and Data Integrity Errors
Number of Error Information Log Entries

sunkuranganath on 13 May 2019

That looks good.

One of the NVMe disks at my disposal (Samsung SSD 960 EVO) reports data from two temperature sensors:

# nvme --smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 39 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 7%
data_units_read                     : 3,024,557
data_units_written                  : 87,710,216
host_read_commands                  : 158,877,287
host_write_commands                 : 1,939,820,563
controller_busy_time                : 25,433
power_cycles                        : 12
power_on_hours                      : 8,718
unsafe_shutdowns                    : 5
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 39 C    <-----------
Temperature Sensor 2                : 48 C    <-----------
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0

It would be nice to be able to collect both of them

shastah on 14 May 2019

imo at least critical_warning would also be good to watch

kkepka on 14 May 2019

Hi guys, we resumed work previously started by @mfijalko. Within the next few days PR will be created. We extended smart plugin to gather metrics from NVMe drives using ioctl.

Plugin will gather at least these generic metrics: