Kibana: get_nodes creates too many buckets when the monitored cluster has many nodes

Created on 22 May 2019 · 11Comments · Source: elastic/kibana

Kibana version: 7.0.1
Elasticsearch version: 7.0.1

get_nodes creates date_histogram aggregations nested under a terms aggregation that partitions by node. A user reported that this causes issues due to Elasticsearch's search.max_buckets limit that defaults to 10,000 since 7.0, see elastic/elasticsearch#42001.

The number of buckets is in the order of num_nodes * range_width / interval. Given the default interval of 10s and the default range of 1h, this means that the aggregation would fail if the monitored cluster has 28 nodes or more.

I'm happy to discuss options to address this issue.

Monitoring bug

Source

jpountz

👍4

Most helpful comment

@pickypg and I chatted about a discuss post and concluded that we can make a quick improvement here.

Currently, we are searching more buckets than we previously thought. It's actually not:
num_nodes * range_width / interval,
but:
num_nodes * 6 * range_width /interval.

This is because the query used to generate the nodes in the nodes listing page actually performs multiple date_histogram aggregations. There is no reason we need to do this - we can add a date_histogram subaggregation directly under the terms agg, then remove all the existing date_histogram aggregations.

This is a quick win and while isn't the long term fix, will help a good chunk of users in the short term.

Thoughts?

chrisronline on 15 Aug 2019

👍5

All 11 comments

Pinging @elastic/stack-monitoring

elasticmachine on 22 May 2019

ycombinator on 22 May 2019

A few points:

1) Given that we can pinpoint a hard limit of a cluster size at which monitoring does not seem function correctly, is it worth updating our documentation to make users aware of this limit? (cc: @lcawl and @yaronp68)

2) @jpountz I'd be very interested to hear more about options you see as being viable for addressing this issue. Would you mind outlining a few of them here?

cachedout on 23 May 2019

Some ideas:

Scale the date_histogram interval based on the width of the date filter. For instance with a width of 1h, could we go with an interval of 1 minute instead of 10s? This would raise the number of nodes that you need to make it fail to about 166 if I'm not mistaken.
Collect nodes first, and then run the date_histogram agg for a subset of the (e.g. 10) nodes at once.
Run the aggregation via a composite aggregation in order to be able to paginate through results.

jpountz on 24 May 2019

Scale the date_histogram interval based on the width of the date filter. For instance with a width of 1h, could we go with an interval of 1 minute instead of 10s? This would raise the number of nodes that you need to make it fail to about 166 if I'm not mistaken.

We're using the same map (from time filter range -> date_histogram interval) as the rest of Kibana for consistency. If we decide to change this map to bump up the date_histogram intervals, we'll probably want to do the same with the rest of Kibana as well, else the user will get an inconsistent UX with timeseries charts and the time picker across Kibana.

Collect nodes first, and then run the date_histogram agg for a subset of the (e.g. 10) nodes at once.

++ to this approach. This is essentially the "server side pagination" that we've been talking about amongst the team.

Run the aggregation via a composite aggregation in order to be able to paginate through results.

++ to this as well. See https://github.com/elastic/kibana/issues/36358.

ycombinator on 25 May 2019

Collect nodes first, and then run the date_histogram agg for a subset of the (e.g. 10) nodes at once.

Some clusters have many tens or even hundreds of nodes, so maybe the default should be 10 slowest indexing nodes and give the user the ability to select a different filter. Typically the user doesn't benefit from seeing all nodes in a large cluster.

yaronp68 on 30 May 2019

👍5

More users are running into this issue as they upgrade to 7.x. Limiting search.max_buckets has become a breaking change for large clusters with monitoring enabled. Can we get a fix or get it documented as a known issue or limitation?

inqueue on 18 Jul 2019

👍1

@inqueue To try to address this, I've moved https://github.com/elastic/kibana/issues/36358 forward in our roadmap.

cachedout on 19 Jul 2019

@pickypg and I chatted about a discuss post and concluded that we can make a quick improvement here.

Currently, we are searching more buckets than we previously thought. It's actually not:
num_nodes * range_width / interval,
but:
num_nodes * 6 * range_width /interval.

This is a quick win and while isn't the long term fix, will help a good chunk of users in the short term.

Thoughts?

chrisronline on 15 Aug 2019

👍5

FYI

The band-aid fix described above has been merged and will be available soon for folks

chrisronline on 29 Aug 2019

Quick update here.

This PR should be a long term, scalable solution for this problem. I'll close out this issue once that is merged

chrisronline on 3 Oct 2019

🎉1 👍1

Was this page helpful?