Kibana: The KQL autocomplete values can take long time

Created on 18 Sep 2019 · 26Comments · Source: elastic/kibana

Currently, KQL autocompletes value suggestions can be slow to appear, especially when there is a lot of data to query to get these possible values. This results in a really slow and frustrating autocomplete.
One of the possible reasons (might be others) for the slowness is that the autocomplete doesn't take into account the time range the users are looking at and suggest all possible values.

Perhaps we can filter the values based on the time range the users are looking at when querying.
@Bargs that was raised as an issue in SIEM and APM in the past, would be good to think how we can improve that in order to keep this implementation consistent in Kibana and not have each solution making their own implementation

@Bargs @TinaHeiligers

KQL AppServices feedback_needed

Source

AlonaNadler

👍5

All 26 comments

Pinging @elastic/kibana-app

elasticmachine on 18 Sep 2019

@AlonaNadler by default the value suggestions should never take more than a second to appear because we have a timeout. However there is a setting in kibana.yml for configuring this timeout. When you experienced the slowness, do you know if this timeout had been increased above its default?

Bargs on 19 Sep 2019

+1 on time values. We are experiencing huge slowdown in queries and high CPU usage of our cluster because suggestions take a long time to get values from cluster (hot+cold with 50TB of data). It also queries for every single character you type in the filter box. 20 searches sent in serial that needs to complete before the final results come back. 100% CPU across all my warm nodes results from that. I'd like to be able to turn off "keystroke by keystroke" suggestion and turn on suggestion. Currently doesn't seem possible.

All queries should also be terminated when a user has fully entered when they needed (or save the filter).

This is a great feature although it doesn't play well with bug clusters used for time series data.

smalenfant on 15 Oct 2019

See the firefox network console when I type phn: dukecdedge01.rd.at.cox.net:

smalenfant on 15 Oct 2019

We are experiencing the same issue. I will copy and paste by comment from the Elastic dicussion board thread here: https://discuss.elastic.co/t/kql-related-performance-issue/199420

We have just updated our ELK installation from version 6.7.1 to 7.3.2 and we are experiencing the same issue. After a couple of days looking at everything from segment count, GC settings to disk i/o on the hosts I managed to pinpoint our high cpu usage and high response times to Kibana and auto completion of filter values together with KQL. When using Lucene syntax or setting filterEditor:suggestValues to Off as suggested above everything is much more responsive!
I have attached a screenshot from Chrome DevTools showing a waterfall diagram of all requests created when trying to search from the Disover view using KQL in Kibana when filter value suggestions are enabled. After the the number of in-flight request threads hits the browser max value subsequent requests are stalled until a previous request is completed - this can result in the actual search query request times out after 30 seconds.

kibana-7 3-filter-editor-suggest-values-enabled-xhr-waterfall

atoom on 17 Oct 2019

Adding another case here.

Took a long time (weeks) to diagnose the reason for a slow response - KQL autocomplete was adding >30s to response times in this case.

markharwood on 2 Dec 2019

Any updates on when the suggest feature will use the "time range" selected instead of the full index scan?

smalenfant on 6 Feb 2020

👍1

https://github.com/elastic/kibana/pull/48450 went into 7.6 and should resolve this issue

rayafratkina on 7 Feb 2020

👍1

Pinging @elastic/kibana-app-arch (Team:AppArch)

elasticmachine on 20 Feb 2020

The fix provided provided some help about making sure 30 requests doesn't make it to the cluster while a user types.

The time range has not been. Turning on the filter suggest value bring our cluster to a crawl for hours since it's trying to hit all our indexes (cold and frozen).

smalenfant on 12 Mar 2020

Thanks for the feeback @smalenfant.
@elastic/kibana-app-arch I'm adding that to our short term, this is a friction point which is important to solve.

AlonaNadler on 12 Mar 2020

Any updates on the progress of this issue?

kustodian on 15 May 2020

Still slow in Kibana 7.7.

erickjordan on 1 Jun 2020

We have a bug in 7.x that makes the terminate_after option to be ignored on search requests that use a size of 0. That explains, I think, why value suggestions are slower in 7.x (the bug was introduced in 7.0).
Although I agree with the comments made here, a value suggester that needs to hit all shards on every keystroke and retrieve 100k docs per shards will likely be slow on large deployment even with the fix. We should look at a more scalable solution and evaluate the cost of having this feature enabled by default on every deployment.

jimczi on 17 Jun 2020

@lukasolson made an interesting suggestion: to use async search to fetch search results progressively.
We'd give a 1s initial timeout for the results, and continue fetching them, as long as the user is not typing something new or chooses an option.
We also talked about applying some kind of sorting, to make sure "hot" data is queries first.

@jimczi does this make sense?

lizozom on 28 Sep 2020

For testing purposes, I used my large data cluster and replaced the query used to fetch autocomplete, to simply getting the latest 20 documents over a 3 year time range.

I used async search, but it takes ~10 seconds until the first result even for this simple query. I'm getting similar results running this query in Dev Tools.

How can we improve this? Or this is the performance to be expected?

{
    "size": 50,
    "sort": [
      {
        "@timestamp": {
          "order": "desc"
        }
      }
    ],
    "docvalue_fields": [
      "@message.keyword"
    ],
    "_source": false,
    "query": {
      "bool": {
        "filter": []
      }
    }
}

lizozom on 4 Oct 2020

I could see several ways to optimize the query:

don't sort
use terminate_after
use a time range filter
disable total hits tracking (or set it to the same as size)

weltenwort on 5 Oct 2020

@weltenwort I posted this query as a part of bench marking different autocomplete query combinations :)
I definitely tried not tracking, using the time range filter and disabling total hits tracking.
Terminate after would just yield partial results, correct?

lizozom on 14 Oct 2020

Yes, AFAIK it would return as soon as the hit count is reached.

weltenwort on 14 Oct 2020

I don't understand why are we discussing how to optimize this query when Kibana 6 just limited the time range and it worked great. That should revert how auto-complete worked before, which would fix most of the issues. Later on, we can discuss if it can be optimized better.

kustodian on 15 Oct 2020

I've benchmarked the performance of our current terms aggregation autocomplete query.
I also tried fetching the latest 50 documents (to potentially combine it with the terms results to speed up the process) and played around with a significant_terms aggregation, with and without a sampler.

I tried out the following configurations:

With and without trackTotalHits
With and without a timerange applied
With and without sorting
With various shard_size configurations

The data used for these tests is a ~40 million logs data set that was generated into a 7.10 staging cloud instance with default configuration.

Results

Times are taken from the took field on the Elasticsearch response in ms.

| Record # | TERMS w/totals, wo/timerange, wo/sort | TERMS w/totals, w/sort | LATEST w/totals, w/sort | TERMS wo/totals, w/sort | LATEST wo/totals, w/sort | TERMS wo/totals, wo/sort | SIG TERMS wo/totals, w/sort | SIG TERMS wo/totals, w/sort, sampler
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| 12M | 6175 | 1862 | 275 | 1024 | 428 | 976 | 2458 | 3702
| 20M | 6155 | 4688 | 3003 | 3260 | 1312 | 3257 | 6365 | 4048
| 40M | 6208 | 7548 | 3465 | 5774 | 350 | 6104 | 9352 | 7023

So it's evident from this table that:

If timerange is not used, the terms aggregation runs at the maximal possible runtime, but even with the timerange, performance does not reach acceptable levels, on an average dataset and with no other queries running on the cluster.
Fetching last X docs is a good way to improve time to initial results
We shouldn't fetch totals when fetching autocomplete results
Sorting doesn't have a visible impact on the terms aggregation
Using a significant terms agg with a sampler didn't seem to make a difference (at least with a basic configuration)

lizozom on 15 Oct 2020

I'm more interested in what are the results when the time range is smaller, like 1h or 1 day. Because currently even if the selected time range is 1h Kibana still queries all the unique terms in the whole cluster, which totally kills the cluster, that's why I'm saying that the time range should be implemented first, and other optimization should be added later.

kustodian on 15 Oct 2020

👍1

Autocomplete short term fix