Given a large enough number of Logstash Pipelines, I have run into a situation where the Logstash Panel does not appear under the Deployment/Cluster overview because Elasticsearch is rejecting the Logstash search due to too many buckets.
I saw this while running v7.5.2 with:
For anyone running into this situation, there are at least three workarounds:
search.max_buckets soft limit in the cluster setting's of the Monitoring cluster that contains the monitoring indices. This can be done dynamically via the _cluster/settings and it defaults to 10000 buckets. Do this with caution because the soft limit exists to limit memory usage in Elasticsearch.cluster_uuid in it, (using either approach above) and navigate to it directly.@igoristic Is this something you can look into?
@chrisronline @igoristic What if you changed the groupBy terms agg to a composite agg then paginated through to collect the results, that would fix the max_bucket issue. You would end up with multiple round trips to ES (from the server) but it would also be more resilient for larger clusters (make it slow :D). Correct me if I'm wrong, but if you fixed this in getSeries() it would also fix it in other places too.
And by paginate through the results, I would keep the behavior of getSeries where it returns everything, I would just make the underlying implementation collect everything with the composite agg.
We were accidentally getting all the pipelines on the Overview page just to see if there is a single bucket (to decide if we want to show Logstash stats). And, since this method had a bug that fetched all nodesCountMetric pipelines per every throughputMetric pipeline, we were essentially doing O(N2) * 2
@pickypg I stress tested this with 100 generator pipelines which did not cause any max buckets errors, and jvm spikes seem to be significantly lower. But, I would like to know how it behaves with your environment
Thanks to @simianhacker's suggestion I did investigated using composite queries, which sped up the query by about ~15%. The composite query for node count looks something like this:
GET *:.monitoring-logstash-6-*,*:.monitoring-logstash-7-*,*:monitoring-logstash-7-*,*:monitoring-logstash-8-*,.monitoring-logstash-6-*,.monitoring-logstash-7-*,monitoring-logstash-7-*,monitoring-logstash-8-*/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"cluster_uuid": "So2SpBkMT-yvN311fn8q3A"
}
},
{
"range": {
"logstash_stats.timestamp": {
"format": "epoch_millis",
"gte": 1582256420161,
"lte": 1582260020161
}
}
}
]
}
},
"aggs": {
"check": {
"composite": {
"size": 1000,
"sources": [
{
"timestamp": {
"date_histogram": {
"field": "logstash_stats.timestamp",
"fixed_interval": "30s"
}
}
}
]
},
"aggs": {
"pipelines_nested": {
"nested": {
"path": "logstash_stats.pipelines"
},
"aggs": {
"by_pipeline_id": {
"terms": {
"field": "logstash_stats.pipelines.id",
"include": ["random_00", "random_01", "random_02", "random_03", "random_04"],
"size": 1000
},
"aggs": {
"to_root": {
"reverse_nested": {},
"aggs": {
"node_count": {
"cardinality": {
"field": "logstash_stats.logstash.uuid"
}
}
}
}
}
}
}
}
}
}
}
}
However, implementing this in the get_series.js was getting a bit too complex, since histogram/series logic is tightly coupled with other metrics that use the helpers in classes.js. Though I think we should still revisit the composite approach, also mentioned here: https://github.com/elastic/kibana/issues/36358
I also, played around with auto_date_histogram, but apparently it too can throw a max buckets error if your buckets strategy isn't allocated properly. The calculation is something like: bucket_size * agg_size, and in reality it'll be more since terms collects up to shard_size (which is 1.5 * size)