Longhorn: [FEATURE] Prometheus support

Created on 13 Apr 2020  路  16Comments  路  Source: longhorn/longhorn

Is your feature request related to a problem? Please describe.
I would like to monitor the longhorn stuff using Prometheus.

Describe the solution you'd like
Expose metrics and provide a servicemonitor in the Helm Chart.

Describe alternatives you've considered
The UI already have a bunch of graphs and health information but you cannot get alerted in case of problems.

Additional context
Add any other context or screenshots about the feature request here.

aremanager enhancement highlight priorit1 requirLEP requirdoc requirmanual-test-plan

Most helpful comment

Hi @PhanLe1010 Are there any plans to name these metrics based on a standard we see across other CSI drivers like rook-ceph, openebs, nfs-client, heketi and others?

e.g...

kubelet_volume_stats_available_bytes
kubelet_volume_stats_used_bytes
kubelet_volume_stats_inodes
kubelet_volume_stats_inodes_free
kubelet_volume_stats_inodes_used

Using these metric names would integrate with the built in prometheus-operator alerts and grafana dashboards.

Edit: This looks like it's up for part 4, would be cool if you considered this. Thanks!

All 16 comments

@yasker For the moment how to you recommend that we monitor the volume usages at the moment so they don't get full.

@yasker For the moment how to you recommend that we monitor the volume usages at the moment so they don't get full.

Also very interested in how to do this..
There are always ways such as querying the api via blackbox exporter.. but thats kind of ugly.

Happy to give it a try as soon as it is available in a rc

Also looking forward to this!
With rook we had the vomlume usage show up within kubernetes as kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes. The latter btw. is already being reported it seems.
Which i guess depends on the way the storage-driver integrates with kubernetes.

Very interested in access to kubelet_volume_stats_available_bytes to allow Strimzi dashboards to work without modification

Pre-merged Checklist

  • [x] Does the PR include the explanation for the fix or the feature?

  • [x] Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
    The PR is at longhorn/longhorn-manager#679, https://github.com/longhorn/longhorn-manager/pull/684, https://github.com/longhorn/longhorn-manager/pull/687, https://github.com/longhorn/longhorn-manager/pull/689

  • [x] Is the reproduce steps/test steps documented?

  • [x] Which areas/issues this PR might have potential impacts on?
    Area: Longhorn upgrade (longhorn manager needs an additional API group, metrics.k8s.io)

  • [x] If the fix introduces the code for backward compatibility Has an separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

  • [x] If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at

  • [x] if labeled: require/doc Has the necessary document PR submitted or merged?
    The Doc issue/PR is at https://github.com/longhorn/website/pull/195

  • [x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
    The automation skeleton PR is at
    The automation test case PR is at

  • [x] if labeled: require/automation-engine Has the engine integration test been merged?
    The engine automation PR is at

  • [x] if labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at https://github.com/longhorn/longhorn-tests/pull/416

In the comming v.1.1.0 release, the code for this feature will be divided into 5 parts corresponding to 5 PRs:

  • Part 1 (the PR at https://github.com/longhorn/longhorn-manager/pull/679) contains metrics:

    1. longhorn_volume_capacity_bytes
    2. longhorn_volume_actual_size_bytes
    3. longhorn_node_status
  • Part 2 (the PR at https://github.com/longhorn/longhorn-manager/pull/684) contains metrics:

    1. longhorn_instance_manager_cpu_requests_millicpu
    2. longhorn_instance_manager_cpu_usage_millicpu
    3. longhorn_instance_manager_memory_requests_bytes
    4. longhorn_instance_manager_memory_usage_bytes
    5. longhorn_manager_cpu_usage_millicpu
    6. longhorn_manager_memory_usage_bytes
  • Part 3 (the PR is at https://github.com/longhorn/longhorn-manager/pull/687) contains the following metrics:

    1. longhorn_node_count_total
    2. longhorn_volume_state
    3. longhorn_volume_robustness
  • Part 4 (the PR is at https://github.com/longhorn/longhorn-manager/pull/689) contains the following metrics:

    1. longhorn_node_cpu_capacity_millicpu
    2. longhorn_node_cpu_usage_millicpu
    3. longhorn_node_memory_capacity_bytes
    4. longhorn_node_memory_usage_bytes
  • Part 5 is tracked at the issue https://github.com/longhorn/longhorn/issues/1832. It contains the following metrics:

    1. longhorn_disk_capacity_bytes
    2. longhorn_disk_usage_bytes
    3. longhorn_node_capacity_bytes
    4. longhorn_node_usage_bytes

Because we are restructuring Longhorn's backup mechanism and we want to have the cleanest implementation for Prometheus metrics, we decide to push back the following metrics to the v.1.2.0 release:

  • longhorn_backup_stats_number_failed_backups
  • longhorn_backup_stats_number_succeed_backups
  • longhorn_backup_stats_backup_status (status for this backup (0=InProgress,1=Done,2=Failed))
  • longhorn_volume_iops
  • longhorn_volume_readthroughput
  • longhorn_volume_writethroughput

Basic verifying/testing steps for QA:

  1. Create a Prometheus-Alermanager-Garafa system using the Prometheus Operator.
  2. Nagivate to Prometheus server's web UI. Verify that Prometheus successfully dicorvers all longhorn manager targets.
  3. Verify Longhorn correctly exposes the above metrics
  4. Deploy workloads that use Longhorn volumes into the cluster. Verify that there is no abnormal data. e.g: volume capacity is 0, CPU usage is over 4000 milicpu, etc..
  1. Attach a volume. Detach the volume. Verify that the volume's information is reported by at most 1 longhorn-manager at any time.

Steps for testing with Rancher monitoring system, Prometheus Alertmanager, and Grafana are in longhorn-test PR https://github.com/longhorn/longhorn-tests/pull/416/files and the document PR: longhorn/website#195

Hi @PhanLe1010 Are there any plans to name these metrics based on a standard we see across other CSI drivers like rook-ceph, openebs, nfs-client, heketi and others?

e.g...

kubelet_volume_stats_available_bytes
kubelet_volume_stats_used_bytes
kubelet_volume_stats_inodes
kubelet_volume_stats_inodes_free
kubelet_volume_stats_inodes_used

Using these metric names would integrate with the built in prometheus-operator alerts and grafana dashboards.

Edit: This looks like it's up for part 4, would be cool if you considered this. Thanks!

@onedr0p Those are named kubelet_volume_stats_*, are those coming from Kubelet? Can you give us the reference documents for those metrics names?

At first glance, it seems that the metrics are related to the internal metrics of a filesystem inside the Longhorn raw block device. Is it correct? Right now, we are building metrics for the Longhorn block devices only. Longhorn doesn't manage the filesystem inside the block devices so getting filesystem information will require a bit more work. We are planning to expose filesystem metrics in the future but not in the coming v1.1.0 release, unfortunately.

By saying the built-in prometheus-operator alerts and grafana dashboards, do you mean the already existed Prometheus server, AlertManager, and Grafana inside the cluster?

@PhanLe1010 from what I can tell from searching the internet is that the CSI drivers need to implement these metrics. Although I'm having a hard time finding much documentation on it. There's this issue and the linked issues/PRs which might contain more information. I know openebs and rook-ceph implement this, so it might be worth checking with them too. If anyone has any more information on this that would be helpful. If not I'll continue to dig around.

The prometheus-operator/kube-prometheus helm chart comes installed with default grafana dashboards and alertmanager rules which require no interaction from the user to install and configure. Having the longhorn metrics named and labeled like I mentioned above would make it so I do not need to maintain another dashboard or alerting rules.

Edit: Here is what I found browsing thru the Kubernetes source https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_metrics.go#L65

Edit 2: Here looks to be the discussion of the documentation https://github.com/container-storage-interface/spec/issues/253 which let me to https://github.com/container-storage-interface/spec/blob/master/spec.md#nodegetvolumestats

Edit 3: Here is a PR with enabling these stats https://github.com/digitalocean/csi-digitalocean/pull/197 and another one https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/524

I hope those additional links clear it up. Thanks!

@PhanLe1010 Do you think it would be better to open a new issue with my findings to have longhorn support kubelet_volume_stats_*? I do think this what the community wanted out of this issue.

The additional metrics you are providing here are very helpful, don't get me wrong.

The prometheus-operator/kube-prometheus-stack Helm chart includes already dashboards (https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/templates/grafana/dashboards/persistentvolumesusage.yaml) and alerts (https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/templates/prometheus/rules/kubernetes-storage.yaml) that use the kubelet_volume_stats_* metrics.
This chart including the alerts and dashboards is also used in monitoring v2 of Rancher 2.5 (https://github.com/rancher/charts/blob/dev-v2.5-source/packages/rancher-monitoring/package.yaml).

So it would be nice, if Longhorn also provided these metrics.

@onedr0p @bashofmann
Thank you so much for your suggestions and supper helpful reference documents!

Yes, we will definitely support and provide the 6 metrics in future:

kubelet_volume_stats_available_bytes
kubelet_volume_stats_used_bytes
kubelet_volume_stats_inodes
kubelet_volume_stats_inodes_free
kubelet_volume_stats_inodes_used

As you already knew, these metrics will be provided by Kubelet (which in turn queries the Longhorn CSI driver plugin) and they measure filesystem related infomation inside a Longhorn block device. The current longhorn_volume_* metrics are exposed via Longhorn manager pods and measuring Longhorn block device specific information . Therefore, we will add the kubelet_volume_stats_* metrics but not rename the existing longhorn_volume_* metrics to kubelet_volume_stats_* format.

@onedr0p You are very welcome to create a new issue with your finding to track the progress of the kubelet_volume_stats_* metrics implementation! Thank you again for you contribution!

Update: Longhorn will support those metrics in the next release (v1.1.0). The feature is tracked at longhorn/longhorn#1821

Verified with longhorn-master 09/30/2020

Validation - Pass

  1. Followed the doc to set up Prometheus and Grafana.
  2. Tested with rancher app monitoring and deploying serviceMonitor.
  3. The alert with email and slack is verified too.

The below metrics are available and working fine:

  1. longhorn_volume_capacity_bytes
  2. longhorn_volume_actual_size_bytes
  3. longhorn_node_status
  4. longhorn_instance_manager_cpu_requests_millicpu
  5. longhorn_instance_manager_cpu_usage_millicpu
  6. longhorn_instance_manager_memory_requests_bytes
  7. longhorn_instance_manager_memory_usage_bytes
  8. longhorn_manager_cpu_usage_millicpu
  9. longhorn_manager_memory_usage_bytes
  10. longhorn_node_count_total
  11. longhorn_volume_state
  12. longhorn_volume_robustness
  13. longhorn_node_cpu_capacity_millicpu
  14. longhorn_node_cpu_usage_millicpu
  15. longhorn_node_memory_capacity_bytes
  16. longhorn_node_memory_usage_bytes
Was this page helpful?
0 / 5 - 0 ratings