Kibana: The KQL autocomplete values can take long time

Created on 18 Sep 2019  路  26Comments  路  Source: elastic/kibana

Currently, KQL autocompletes value suggestions can be slow to appear, especially when there is a lot of data to query to get these possible values. This results in a really slow and frustrating autocomplete.
One of the possible reasons (might be others) for the slowness is that the autocomplete doesn't take into account the time range the users are looking at and suggest all possible values.

Perhaps we can filter the values based on the time range the users are looking at when querying.
@Bargs that was raised as an issue in SIEM and APM in the past, would be good to think how we can improve that in order to keep this implementation consistent in Kibana and not have each solution making their own implementation

@Bargs @TinaHeiligers

KQL AppServices feedback_needed

All 26 comments

Pinging @elastic/kibana-app

@AlonaNadler by default the value suggestions should never take more than a second to appear because we have a timeout. However there is a setting in kibana.yml for configuring this timeout. When you experienced the slowness, do you know if this timeout had been increased above its default?

+1 on time values. We are experiencing huge slowdown in queries and high CPU usage of our cluster because suggestions take a long time to get values from cluster (hot+cold with 50TB of data). It also queries for every single character you type in the filter box. 20 searches sent in serial that needs to complete before the final results come back. 100% CPU across all my warm nodes results from that. I'd like to be able to turn off "keystroke by keystroke" suggestion and turn on suggestion. Currently doesn't seem possible.

All queries should also be terminated when a user has fully entered when they needed (or save the filter).

This is a great feature although it doesn't play well with bug clusters used for time series data.

See the firefox network console when I type phn: dukecdedge01.rd.at.cox.net:

image

We are experiencing the same issue. I will copy and paste by comment from the Elastic dicussion board thread here: https://discuss.elastic.co/t/kql-related-performance-issue/199420

We have just updated our ELK installation from version 6.7.1 to 7.3.2 and we are experiencing the same issue. After a couple of days looking at everything from segment count, GC settings to disk i/o on the hosts I managed to pinpoint our high cpu usage and high response times to Kibana and auto completion of filter values together with KQL. When using Lucene syntax or setting filterEditor:suggestValues to Off as suggested above everything is much more responsive!
I have attached a screenshot from Chrome DevTools showing a waterfall diagram of all requests created when trying to search from the Disover view using KQL in Kibana when filter value suggestions are enabled. After the the number of in-flight request threads hits the browser max value subsequent requests are stalled until a previous request is completed - this can result in the actual search query request times out after 30 seconds.

kibana-7 3-filter-editor-suggest-values-enabled-xhr-waterfall

Adding another case here.

Took a long time (weeks) to diagnose the reason for a slow response - KQL autocomplete was adding >30s to response times in this case.

Any updates on when the suggest feature will use the "time range" selected instead of the full index scan?

https://github.com/elastic/kibana/pull/48450 went into 7.6 and should resolve this issue

Pinging @elastic/kibana-app-arch (Team:AppArch)

The fix provided provided some help about making sure 30 requests doesn't make it to the cluster while a user types.

The time range has not been. Turning on the filter suggest value bring our cluster to a crawl for hours since it's trying to hit all our indexes (cold and frozen).

Thanks for the feeback @smalenfant.
@elastic/kibana-app-arch I'm adding that to our short term, this is a friction point which is important to solve.

Any updates on the progress of this issue?

Still slow in Kibana 7.7.

We have a bug in 7.x that makes the terminate_after option to be ignored on search requests that use a size of 0. That explains, I think, why value suggestions are slower in 7.x (the bug was introduced in 7.0).
Although I agree with the comments made here, a value suggester that needs to hit all shards on every keystroke and retrieve 100k docs per shards will likely be slow on large deployment even with the fix. We should look at a more scalable solution and evaluate the cost of having this feature enabled by default on every deployment.

@lukasolson made an interesting suggestion: to use async search to fetch search results progressively.
We'd give a 1s initial timeout for the results, and continue fetching them, as long as the user is not typing something new or chooses an option.
We also talked about applying some kind of sorting, to make sure "hot" data is queries first.

@jimczi does this make sense?

For testing purposes, I used my large data cluster and replaced the query used to fetch autocomplete, to simply getting the latest 20 documents over a 3 year time range.

I used async search, but it takes ~10 seconds until the first result even for this simple query. I'm getting similar results running this query in Dev Tools.

How can we improve this? Or this is the performance to be expected?

{
    "size": 50,
    "sort": [
      {
        "@timestamp": {
          "order": "desc"
        }
      }
    ],
    "docvalue_fields": [
      "@message.keyword"
    ],
    "_source": false,
    "query": {
      "bool": {
        "filter": []
      }
    }
}

I could see several ways to optimize the query:

  • don't sort
  • use terminate_after
  • use a time range filter
  • disable total hits tracking (or set it to the same as size)

@weltenwort I posted this query as a part of bench marking different autocomplete query combinations :)
I definitely tried not tracking, using the time range filter and disabling total hits tracking.
Terminate after would just yield partial results, correct?

Yes, AFAIK it would return as soon as the hit count is reached.

I don't understand why are we discussing how to optimize this query when Kibana 6 just limited the time range and it worked great. That should revert how auto-complete worked before, which would fix most of the issues. Later on, we can discuss if it can be optimized better.

I've benchmarked the performance of our current terms aggregation autocomplete query.
I also tried fetching the latest 50 documents (to potentially combine it with the terms results to speed up the process) and played around with a significant_terms aggregation, with and without a sampler.

I tried out the following configurations:

  • With and without trackTotalHits
  • With and without a timerange applied
  • With and without sorting
  • With various shard_size configurations

The data used for these tests is a ~40 million logs data set that was generated into a 7.10 staging cloud instance with default configuration.

Results

Times are taken from the took field on the Elasticsearch response in ms.

| Record # | TERMS w/totals, wo/timerange, wo/sort | TERMS w/totals, w/sort | LATEST w/totals, w/sort | TERMS wo/totals, w/sort | LATEST wo/totals, w/sort | TERMS wo/totals, wo/sort | SIG TERMS wo/totals, w/sort | SIG TERMS wo/totals, w/sort, sampler
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| 12M | 6175 | 1862 | 275 | 1024 | 428 | 976 | 2458 | 3702
| 20M | 6155 | 4688 | 3003 | 3260 | 1312 | 3257 | 6365 | 4048
| 40M | 6208 | 7548 | 3465 | 5774 | 350 | 6104 | 9352 | 7023

So it's evident from this table that:

  • If timerange is not used, the terms aggregation runs at the maximal possible runtime, but even with the timerange, performance does not reach acceptable levels, on an average dataset and with no other queries running on the cluster.
  • Fetching last X docs is a good way to improve time to initial results
  • We shouldn't fetch totals when fetching autocomplete results
  • Sorting doesn't have a visible impact on the terms aggregation
  • Using a significant terms agg with a sampler didn't seem to make a difference (at least with a basic configuration)

I'm more interested in what are the results when the time range is smaller, like 1h or 1 day. Because currently even if the selected time range is 1h Kibana still queries all the unique terms in the whole cluster, which totally kills the cluster, that's why I'm saying that the time range should be implemented first, and other optimization should be added later.

I guess that Trello board is internal only?

@kustodian These are just an artifact of a misconfigured integration. This is still the main issue used to track and discuss.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cafuego picture cafuego  路  3Comments

timmolter picture timmolter  路  3Comments

snide picture snide  路  3Comments

stacey-gammon picture stacey-gammon  路  3Comments

MaartenUreel picture MaartenUreel  路  3Comments