Currently as I believe, the only way to programatically check the status of a driver on a Nomad client is to process the /v1/node/:node_id API endpoint. In situations where a driver fails, but the cluster has capacity to place the workload on another node, it is possible the driver failure could go unnoticed.
It would be helpful if there was an easier way to monitor the health of a Nomad client node driver, which could in-turn be integrated into an alerting system. A potential thought on this could be to register the detected drivers in Consul as a health check under the Nomad client catalog entry. The health check could be updated as the driver health changes, allowing for easier operation and better observability of cluster issues.
cc @stevenscg
Good call, It would potentially be interesting to emit metrics based on driver/plugin health for folks who run alerting through them too.
Most helpful comment
Good call, It would potentially be interesting to emit metrics based on driver/plugin health for folks who run alerting through them too.