Longhorn: [FEATURE] Prometheus support

Created on 13 Apr 2020 · 16Comments · Source: longhorn/longhorn

Is your feature request related to a problem? Please describe.
I would like to monitor the longhorn stuff using Prometheus.

Describe the solution you'd like
Expose metrics and provide a servicemonitor in the Helm Chart.

Describe alternatives you've considered
The UI already have a bunch of graphs and health information but you cannot get alerted in case of problems.

Additional context
Add any other context or screenshots about the feature request here.

aremanager enhancement highlight priorit1 requirLEP requirdoc requirmanual-test-plan

Source

runningman84

👍18 🚀1

Most helpful comment

Hi @PhanLe1010 Are there any plans to name these metrics based on a standard we see across other CSI drivers like rook-ceph, openebs, nfs-client, heketi and others?

e.g...

kubelet_volume_stats_available_bytes
kubelet_volume_stats_used_bytes
kubelet_volume_stats_inodes
kubelet_volume_stats_inodes_free
kubelet_volume_stats_inodes_used

Using these metric names would integrate with the built in prometheus-operator alerts and grafana dashboards.

Edit: This looks like it's up for part 4, would be cool if you considered this. Thanks!

onedr0p on 23 Sep 2020

👍5

All 16 comments

@yasker For the moment how to you recommend that we monitor the volume usages at the moment so they don't get full.

timmy59100 on 26 May 2020

@yasker For the moment how to you recommend that we monitor the volume usages at the moment so they don't get full.

Also very interested in how to do this..
There are always ways such as querying the api via blackbox exporter.. but thats kind of ugly.

tabnul on 5 Jun 2020

Happy to give it a try as soon as it is available in a rc

abuisine on 9 Jun 2020

Also looking forward to this!
With rook we had the vomlume usage show up within kubernetes as kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes. The latter btw. is already being reported it seems.
Which i guess depends on the way the storage-driver integrates with kubernetes.

strowi on 4 Aug 2020

Very interested in access to kubelet_volume_stats_available_bytes to allow Strimzi dashboards to work without modification

tomdoherty on 5 Aug 2020

👍1

Pre-merged Checklist

[x] Does the PR include the explanation for the fix or the feature?
[x] Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
The PR is at longhorn/longhorn-manager#679, https://github.com/longhorn/longhorn-manager/pull/684, https://github.com/longhorn/longhorn-manager/pull/687, https://github.com/longhorn/longhorn-manager/pull/689
[x] Is the reproduce steps/test steps documented?
[x] Which areas/issues this PR might have potential impacts on?
Area: Longhorn upgrade (longhorn manager needs an additional API group, metrics.k8s.io)
[x] If the fix introduces the code for backward compatibility Has an separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at
[x] If labeled: area/ui Has the UI issue filed or ready to be merged?
The UI issue/PR is at
[x] if labeled: require/doc Has the necessary document PR submitted or merged?
The Doc issue/PR is at https://github.com/longhorn/website/pull/195
[x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
The automation skeleton PR is at
The automation test case PR is at
[x] if labeled: require/automation-engine Has the engine integration test been merged?
The engine automation PR is at
[x] if labeled: require/manual-test-plan Has the manual test plan been documented?
The updated manual test plan is at https://github.com/longhorn/longhorn-tests/pull/416

longhorn-io-github-bot on 15 Sep 2020

In the comming v.1.1.0 release, the code for this feature will be divided into 5 parts corresponding to 5 PRs:

Part 1 (the PR at https://github.com/longhorn/longhorn-manager/pull/679) contains metrics:
1. longhorn_volume_capacity_bytes
2. longhorn_volume_actual_size_bytes
3. longhorn_node_status
Part 2 (the PR at https://github.com/longhorn/longhorn-manager/pull/684) contains metrics:
1. longhorn_instance_manager_cpu_requests_millicpu
2. longhorn_instance_manager_cpu_usage_millicpu
3. longhorn_instance_manager_memory_requests_bytes
4. longhorn_instance_manager_memory_usage_bytes
5. longhorn_manager_cpu_usage_millicpu
6. longhorn_manager_memory_usage_bytes
Part 3 (the PR is at https://github.com/longhorn/longhorn-manager/pull/687) contains the following metrics:
1. longhorn_node_count_total
2. longhorn_volume_state
3. longhorn_volume_robustness
Part 4 (the PR is at https://github.com/longhorn/longhorn-manager/pull/689) contains the following metrics:
1. longhorn_node_cpu_capacity_millicpu
2. longhorn_node_cpu_usage_millicpu
3. longhorn_node_memory_capacity_bytes
4. longhorn_node_memory_usage_bytes
Part 5 is tracked at the issue https://github.com/longhorn/longhorn/issues/1832. It contains the following metrics:
1. longhorn_disk_capacity_bytes
2. longhorn_disk_usage_bytes
3. longhorn_node_capacity_bytes
4. longhorn_node_usage_bytes

Because we are restructuring Longhorn's backup mechanism and we want to have the cleanest implementation for Prometheus metrics, we decide to push back the following metrics to the v.1.2.0 release:

longhorn_backup_stats_number_failed_backups
longhorn_backup_stats_number_succeed_backups
longhorn_backup_stats_backup_status (status for this backup (0=InProgress,1=Done,2=Failed))
longhorn_volume_iops
longhorn_volume_readthroughput
longhorn_volume_writethroughput

PhanLe1010 on 16 Sep 2020

👍2

Basic verifying/testing steps for QA:

Create a Prometheus-Alermanager-Garafa system using the Prometheus Operator.
Nagivate to Prometheus server's web UI. Verify that Prometheus successfully dicorvers all longhorn manager targets.
Verify Longhorn correctly exposes the above metrics
Deploy workloads that use Longhorn volumes into the cluster. Verify that there is no abnormal data. e.g: volume capacity is 0, CPU usage is over 4000 milicpu, etc..

Attach a volume. Detach the volume. Verify that the volume's information is reported by at most 1 longhorn-manager at any time.

PhanLe1010 on 16 Sep 2020

Steps for testing with Rancher monitoring system, Prometheus Alertmanager, and Grafana are in longhorn-test PR https://github.com/longhorn/longhorn-tests/pull/416/files and the document PR: longhorn/website#195

PhanLe1010 on 16 Sep 2020

👍1

Hi @PhanLe1010 Are there any plans to name these metrics based on a standard we see across other CSI drivers like rook-ceph, openebs, nfs-client, heketi and others?

e.g...

kubelet_volume_stats_available_bytes
kubelet_volume_stats_used_bytes
kubelet_volume_stats_inodes
kubelet_volume_stats_inodes_free
kubelet_volume_stats_inodes_used

Using these metric names would integrate with the built in prometheus-operator alerts and grafana dashboards.

Edit: This looks like it's up for part 4, would be cool if you considered this. Thanks!

onedr0p on 23 Sep 2020

👍5

@onedr0p Those are named kubelet_volume_stats_*, are those coming from Kubelet? Can you give us the reference documents for those metrics names?

At first glance, it seems that the metrics are related to the internal metrics of a filesystem inside the Longhorn raw block device. Is it correct? Right now, we are building metrics for the Longhorn block devices only. Longhorn doesn't manage the filesystem inside the block devices so getting filesystem information will require a bit more work. We are planning to expose filesystem metrics in the future but not in the coming v1.1.0 release, unfortunately.

By saying the built-in prometheus-operator alerts and grafana dashboards, do you mean the already existed Prometheus server, AlertManager, and Grafana inside the cluster?

PhanLe1010 on 23 Sep 2020

@PhanLe1010 from what I can tell from searching the internet is that the CSI drivers need to implement these metrics. Although I'm having a hard time finding much documentation on it. There's this issue and the linked issues/PRs which might contain more information. I know openebs and rook-ceph implement this, so it might be worth checking with them too. If anyone has any more information on this that would be helpful. If not I'll continue to dig around.

The prometheus-operator/kube-prometheus helm chart comes installed with default grafana dashboards and alertmanager rules which require no interaction from the user to install and configure. Having the longhorn metrics named and labeled like I mentioned above would make it so I do not need to maintain another dashboard or alerting rules.

Edit: Here is what I found browsing thru the Kubernetes source https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_metrics.go#L65

Edit 2: Here looks to be the discussion of the documentation https://github.com/container-storage-interface/spec/issues/253 which let me to https://github.com/container-storage-interface/spec/blob/master/spec.md#nodegetvolumestats

Edit 3: Here is a PR with enabling these stats https://github.com/digitalocean/csi-digitalocean/pull/197 and another one https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/524

I hope those additional links clear it up. Thanks!

onedr0p on 23 Sep 2020

👍1

@PhanLe1010 Do you think it would be better to open a new issue with my findings to have longhorn support kubelet_volume_stats_*? I do think this what the community wanted out of this issue.

The additional metrics you are providing here are very helpful, don't get me wrong.

onedr0p on 23 Sep 2020

👍1

The prometheus-operator/kube-prometheus-stack Helm chart includes already dashboards (https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/templates/grafana/dashboards/persistentvolumesusage.yaml) and alerts (https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/templates/prometheus/rules/kubernetes-storage.yaml) that use the kubelet_volume_stats_* metrics.
This chart including the alerts and dashboards is also used in monitoring v2 of Rancher 2.5 (https://github.com/rancher/charts/blob/dev-v2.5-source/packages/rancher-monitoring/package.yaml).

So it would be nice, if Longhorn also provided these metrics.

bashofmann on 24 Sep 2020

👍3

@onedr0p @bashofmann
Thank you so much for your suggestions and supper helpful reference documents!

Yes, we will definitely support and provide the 6 metrics in future:

kubelet_volume_stats_available_bytes
kubelet_volume_stats_used_bytes
kubelet_volume_stats_inodes
kubelet_volume_stats_inodes_free
kubelet_volume_stats_inodes_used

As you already knew, these metrics will be provided by Kubelet (which in turn queries the Longhorn CSI driver plugin) and they measure filesystem related infomation inside a Longhorn block device. The current longhorn_volume_* metrics are exposed via Longhorn manager pods and measuring Longhorn block device specific information . Therefore, we will add the kubelet_volume_stats_* metrics but not rename the existing longhorn_volume_* metrics to kubelet_volume_stats_* format.

@onedr0p You are very welcome to create a new issue with your finding to track the progress of the kubelet_volume_stats_* metrics implementation! Thank you again for you contribution!

Update: Longhorn will support those metrics in the next release (v1.1.0). The feature is tracked at longhorn/longhorn#1821

PhanLe1010 on 24 Sep 2020

👍3

Verified with longhorn-master 09/30/2020

Validation - Pass