Couchbase Plugin Enhancement to include key metrics to monitor couchbase bucket performance.
Opening a feature request kicks off a discussion.
Few of the key metrics missing in the default couchbase plugin are
(Full list of requested metrics for Couchbase plugin - _edited Oct 27, 2020 by @sjwang90_):
vb_active_resident_items_ratio
ep_queue_size
ep_cache_miss_rate
ep_tmp_oom_errors
ep_dcp_xdcr_items_remaining
couch_docs_fragmentation
query-requests_1000ms
gets per seconds
sets per seconds
deletes per seconds
active resident ratio
inbound xdcr ops/sec
outbound xdcr mutations
docs fragmentation %
connections
cluster node count
couch_docs_actual_disk_size
ep_cache_miss_rate
couch_docs_fragmentation
ep_bg_fetched
vb_active_resident_items_ratio
ep_queue_size
vb_num_eject_replicas
ep_warmup_value_count
vb_active_curr_items
ep_cache_miss_rate
couch_docs_fragmentation
ep_queue_size
vb_active_resident_items_ratio
curr_connections
curr_items_tot
ep_bg_fetched
ep_diskqueue_drain
ep_diskqueue_fill
vb_replica_eject
ep_oom_errors
ep_queue_size
ep_tmp_oom_errors
[etc...]
References:
https://blog.couchbase.com/monitoring-couchbase-cluster/
Only a handful of metrics are being monitored currently.
https://github.com/influxdata/telegraf/tree/master/plugins/inputs/couchbase
Fields:
quota_percent_used (unit: percent, example: 68.85424936294555)
ops_per_sec (unit: count, example: 5686.789686789687)
disk_fetches (unit: count, example: 0.0)
item_count (unit: count, example: 943239752.0)
disk_used (unit: bytes, example: 409178772321.0)
data_used (unit: bytes, example: 212179309111.0)
mem_used (unit: bytes, example: 202156957464.0)
Need additional metrics to effectively monitor Couchbase.
There are heavy users of Couchbase growing rapidly. IT would be very helpful, if we can improve on this.
@DharanDP thanks for reporting. we have a lot of items to tackle so we might not get to this for a while. Seems like it would be pretty easy to add them here: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/couchbase/couchbase.go#L92
interested in taking a crack and submitting a PR?
I'll give this a try.
Update 1: Looks like it's complex than it looks. Couchbase spread metrics over time, that means the API will give 60 data point per metric rather than one, these 60 data points are per one second for a minute. Based on the metric, we need to average in some cases and sum it in some cases. Though the API response interval can be configured by a zoom parameter, this gets tricky as the global interval configuration of telegraf can be different than the zoom configuration. I'm dropping this for now.
Extend fields for Couchbase Input.
Based on the blogpost by Couchbase for monitoring nodes/clusters: https://blog.couchbase.com/monitoring-couchbase-cluster/
I would like to ask if we could extend the fields to what couchbsae recommends to be monitoried.
just a few fields are available currently:
memory_total
quota_percent_used
ops_per_sec
disk_fetches
item_count
disk_used
data_used
mem_used
Recommended to monitor by Couchbase
current:
memory_free
memory_total
quota_percent_used
ops_per_sec
disk_fetches
item_count
disk_used
data_used
mem_used
+ Extensions:
couch_docs_actual_disk_size
ep_cache_miss_rate
couch_docs_fragmentation
ep_bg_fetched
vb_active_resident_items_ratio
ep_queue_size
vb_num_eject_replicas
ep_warmup_value_count
vb_active_curr_items
ep_cache_miss_rate
couch_docs_fragmentation
ep_queue_size
vb_active_resident_items_ratio
curr_connections
curr_items_tot
ep_bg_fetched
ep_diskqueue_drain
ep_diskqueue_fill
vb_replica_eject
ep_oom_errors
ep_queue_size
ep_tmp_oom_errors
[etc...]
Could someone tell me if we can hope/expect any progress here in the near future?
This isn't very high on my list of issues, but I could help if someone from the community is willing to do the work.
I'm interested in helping with telegraf and would like to take a look at this.
It looks like the plugin is currently using an older unofficial library for couchbase. I haven't been able to find anything showing that the older library supports these additional stats.
There is a newer official library that might be needed to get the additional stats.
Thanks for the help @nwneisen. Long ago I tried to switch to gocb, in issue #2418, but ran into some issues with supporting our current set of metrics. The upstream issue is still not closed, but maybe it is actually fixed and just not marked? Would be much appreciated if you could check.
Thanks for the info @danielnelson. I'll take a look.
@danielnelson The issue is still unresolved.
@danielnelson Any progress here?
FYI @ssoroka
Looks like the upstream issue was resolved. If @nwneisen or anyone else is interested in working on #2418 first, then getting these additional metrics into the plugin can be done hopefully pretty seamlessly.
Most helpful comment
I'll give this a try.
Update 1: Looks like it's complex than it looks. Couchbase spread metrics over time, that means the API will give 60 data point per metric rather than one, these 60 data points are per one second for a minute. Based on the metric, we need to average in some cases and sum it in some cases. Though the API response interval can be configured by a
zoomparameter, this gets tricky as the global interval configuration of telegraf can be different than the zoom configuration. I'm dropping this for now.