Kibana: [Stack Monitoring] Logstash Overview Panel missing due to max_buckets

Created on 31 Jan 2020 · 3Comments · Source: elastic/kibana

Given a large enough number of Logstash Pipelines, I have run into a situation where the Logstash Panel does not appear under the Deployment/Cluster overview because Elasticsearch is rejecting the Logstash search due to too many buckets.

I saw this while running v7.5.2 with:

8 Logstash nodes
97 Logstash pipelines

Workaround

For anyone running into this situation, there are at least three workarounds:

Increase the time interval, which will reduce the amount of data being collected by the background query.
Increase the search.max_buckets soft limit in the cluster setting's of the Monitoring cluster that contains the monitoring indices. This can be done dynamically via the _cluster/settings and it defaults to 10000 buckets. Do this with caution because the soft limit exists to limit memory usage in Elasticsearch.
Find the URL, with the cluster_uuid in it, (using either approach above) and navigate to it directly.

Stack Monitoring bug

Source

pickypg

All 3 comments

@igoristic Is this something you can look into?

chrisronline on 31 Jan 2020

👍1

@chrisronline @igoristic What if you changed the groupBy terms agg to a composite agg then paginated through to collect the results, that would fix the max_bucket issue. You would end up with multiple round trips to ES (from the server) but it would also be more resilient for larger clusters (make it slow :D). Correct me if I'm wrong, but if you fixed this in getSeries() it would also fix it in other places too.

And by paginate through the results, I would keep the behavior of getSeries where it returns everything, I would just make the underlying implementation collect everything with the composite agg.

simianhacker on 4 Feb 2020

❤1

We were accidentally getting all the pipelines on the Overview page just to see if there is a single bucket (to decide if we want to show Logstash stats). And, since this method had a bug that fetched all nodesCountMetric pipelines per every throughputMetric pipeline, we were essentially doing O(N²) * 2

@pickypg I stress tested this with 100 generator pipelines which did not cause any max buckets errors, and jvm spikes seem to be significantly lower. But, I would like to know how it behaves with your environment

Thanks to @simianhacker's suggestion I did investigated using composite queries, which sped up the query by about ~15%. The composite query for node count looks something like this:

GET *:.monitoring-logstash-6-*,*:.monitoring-logstash-7-*,*:monitoring-logstash-7-*,*:monitoring-logstash-8-*,.monitoring-logstash-6-*,.monitoring-logstash-7-*,monitoring-logstash-7-*,monitoring-logstash-8-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "cluster_uuid": "So2SpBkMT-yvN311fn8q3A"
          }
        },
        {
          "range": {
            "logstash_stats.timestamp": {
              "format": "epoch_millis",
              "gte": 1582256420161,
              "lte": 1582260020161
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "check": {
      "composite": {
        "size": 1000,
        "sources": [
          {
            "timestamp": {
              "date_histogram": {
                "field": "logstash_stats.timestamp",
                "fixed_interval": "30s"
              }
            }
          }
        ]
      },
      "aggs": {
        "pipelines_nested": {
          "nested": {
            "path": "logstash_stats.pipelines"
          },
          "aggs": {
            "by_pipeline_id": {
              "terms": {
                "field": "logstash_stats.pipelines.id",
                "include": ["random_00", "random_01", "random_02", "random_03", "random_04"],
                "size": 1000
              },
              "aggs": {
                "to_root": {
                  "reverse_nested": {},
                  "aggs": {
                    "node_count": {
                      "cardinality": {
                        "field": "logstash_stats.logstash.uuid"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

However, implementing this in the get_series.js was getting a bit too complex, since histogram/series logic is tightly coupled with other metrics that use the helpers in classes.js. Though I think we should still revisit the composite approach, also mentioned here: https://github.com/elastic/kibana/issues/36358

I also, played around with auto_date_histogram, but apparently it too can throw a max buckets error if your buckets strategy isn't allocated properly. The calculation is something like: bucket_size * agg_size, and in reality it'll be more since terms collects up to shard_size (which is 1.5 * size)