Kibana version: 7.0.1
Elasticsearch version: 7.0.1
get_nodes
creates date_histogram
aggregations nested under a terms
aggregation that partitions by node. A user reported that this causes issues due to Elasticsearch's search.max_buckets
limit that defaults to 10,000 since 7.0, see elastic/elasticsearch#42001.
The number of buckets is in the order of num_nodes * range_width / interval
. Given the default interval of 10s and the default range of 1h, this means that the aggregation would fail if the monitored cluster has 28 nodes or more.
I'm happy to discuss options to address this issue.
Pinging @elastic/stack-monitoring
A few points:
1) Given that we can pinpoint a hard limit of a cluster size at which monitoring does not seem function correctly, is it worth updating our documentation to make users aware of this limit? (cc: @lcawl and @yaronp68)
2) @jpountz I'd be very interested to hear more about options you see as being viable for addressing this issue. Would you mind outlining a few of them here?
Some ideas:
date_histogram
agg for a subset of the (e.g. 10) nodes at once.composite
aggregation in order to be able to paginate through results.Scale the date_histogram interval based on the width of the date filter. For instance with a width of 1h, could we go with an interval of 1 minute instead of 10s? This would raise the number of nodes that you need to make it fail to about 166 if I'm not mistaken.
We're using the same map (from time filter range -> date_histogram interval) as the rest of Kibana for consistency. If we decide to change this map to bump up the date_histogram intervals, we'll probably want to do the same with the rest of Kibana as well, else the user will get an inconsistent UX with timeseries charts and the time picker across Kibana.
Collect nodes first, and then run the date_histogram agg for a subset of the (e.g. 10) nodes at once.
++ to this approach. This is essentially the "server side pagination" that we've been talking about amongst the team.
Run the aggregation via a composite aggregation in order to be able to paginate through results.
++ to this as well. See https://github.com/elastic/kibana/issues/36358.
Collect nodes first, and then run the date_histogram agg for a subset of the (e.g. 10) nodes at once.
Some clusters have many tens or even hundreds of nodes, so maybe the default should be 10 slowest indexing nodes
and give the user the ability to select a different filter. Typically the user doesn't benefit from seeing all nodes in a large cluster.
More users are running into this issue as they upgrade to 7.x. Limiting search.max_buckets
has become a breaking change for large clusters with monitoring enabled. Can we get a fix or get it documented as a known issue or limitation?
@inqueue To try to address this, I've moved https://github.com/elastic/kibana/issues/36358 forward in our roadmap.
@pickypg and I chatted about a discuss post and concluded that we can make a quick improvement here.
Currently, we are searching more buckets than we previously thought. It's actually not:
num_nodes * range_width / interval
,
but:
num_nodes * 6 * range_width /interval
.
This is because the query used to generate the nodes in the nodes listing page actually performs multiple date_histogram aggregations. There is no reason we need to do this - we can add a date_histogram
subaggregation directly under the terms
agg, then remove all the existing date_histogram
aggregations.
This is a quick win and while isn't the long term fix, will help a good chunk of users in the short term.
Thoughts?
FYI
The band-aid fix described above has been merged and will be available soon for folks
Quick update here.
This PR should be a long term, scalable solution for this problem. I'll close out this issue once that is merged
Most helpful comment
@pickypg and I chatted about a discuss post and concluded that we can make a quick improvement here.
Currently, we are searching more buckets than we previously thought. It's actually not:
num_nodes * range_width / interval
,but:
num_nodes * 6 * range_width /interval
.This is because the query used to generate the nodes in the nodes listing page actually performs multiple date_histogram aggregations. There is no reason we need to do this - we can add a
date_histogram
subaggregation directly under theterms
agg, then remove all the existingdate_histogram
aggregations.This is a quick win and while isn't the long term fix, will help a good chunk of users in the short term.
Thoughts?