Currently smart plugin uses libatasmart to read disk health status. NVMe is getting more popular and it would be useful to have collectd report their health too.
FYI I started working at this.
Right now I am including NVMe headers from linux kernel and send proper ioctl's.
Here’s some test output from it – this was performed on a VM with emulated NVMe disk – that’s enough for checking the ioctl path:
[2018-04-20 17:33:37] smart plugin: checking SMART status of /dev/nvme0n1.
[2018-04-20 17:33:37] available_spare : 0%
[2018-04-20 17:33:37] available_spare_threshold : 20%
[2018-04-20 17:33:37] percentage_used : 0%
[2018-04-20 17:33:37] data_units_read : 196397588482
[2018-04-20 17:33:37] data_units_written : 205791232
[2018-04-20 17:33:37] host_read_commands : 19172
[2018-04-20 17:33:37] host_write_commands : 213
[2018-04-20 17:33:37] controller_busy_time : 0
[2018-04-20 17:33:37] power_cycles : 0
[2018-04-20 17:33:37] power_on_hours : 72
[2018-04-20 17:33:37] unsafe_shutdowns : 0
[2018-04-20 17:33:37] media_errors : 0
[2018-04-20 17:33:37] num_err_log_entries : 0
Questions to community about implementation, before I go further:
There is a smartmontools with NVMe smart support but is it not available as a library (and I guess it won't be - see https://www.smartmontools.org/ticket/501).
Hi!
Thanks for your work on Collectd.
Can you please create PR for this? We can see your work in progress then, and that will allow to put suggestions and so on.
Short googling finds https://github.com/hgst/libnvme.
May be that project can be useful in this work, or vice versa?
Here’s some test output from it – this was performed on a VM with emulated NVMe disk
Can you please provide short guide about your setup?
@rpv-tomsk following up on this. Will post updates soon. Thanks for your review.
any progress?
@NebulaNeko This item is in our backlog as we are focusing on enabling few other plugins at the moment.
What would help our team is if anyone could share how this would help them. Any info on how you would use this for their deployments would help.
The following list of metrics are considered for this plugin.
Let us know if there are any specific metric(s) that you dont see below:
Composite Temperature
Available Spare
Available Spare Threshold
Percentage Used
Data Unit Read
Data Unit Written
Host Read Commands
Host Write Commands
Controller Busy Time
Power Cycles
Power On Hours
Unsafe Shutdowns
Media and Data Integrity Errors
Number of Error Information Log Entries
That looks good.
One of the NVMe disks at my disposal (Samsung SSD 960 EVO) reports data from two temperature sensors:
# nvme --smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 39 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 7%
data_units_read : 3,024,557
data_units_written : 87,710,216
host_read_commands : 158,877,287
host_write_commands : 1,939,820,563
controller_busy_time : 25,433
power_cycles : 12
power_on_hours : 8,718
unsafe_shutdowns : 5
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 39 C <-----------
Temperature Sensor 2 : 48 C <-----------
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
It would be nice to be able to collect both of them
imo at least critical_warning would also be good to watch
Hi guys, we resumed work previously started by @mfijalko. Within the next few days PR will be created. We extended smart plugin to gather metrics from NVMe drives using ioctl.
Plugin will gather at least these generic metrics:
And at least these Intel-specific metrics:
@shastah as you see, there will be more Temperature Sensors available.
What do you think?
any progress?
Most helpful comment
Hi guys, we resumed work previously started by @mfijalko. Within the next few days PR will be created. We extended smart plugin to gather metrics from NVMe drives using
ioctl.Plugin will gather at least these generic metrics:
And at least these Intel-specific metrics:
@shastah as you see, there will be more Temperature Sensors available.
What do you think?