Kibana version:
6.0.0-alpha1
Elasticsearch version:
6.0.0-alpha1
Server OS version:
Ubuntu 16.04.2 LTS
Browser version:
Safari 10.1.1 (12603.2.4)
Browser OS version:
macOS 10.2.5
Original install method (e.g. download page, yum, from source, etc.):
download
Description of the problem including expected versus actual behavior:
Discover view times out
Steps to reproduce:
Open Discover view for a relatively small index (6.7K docs, 15.6MB data), but with a lot of fields (950 in my case) - the Discover view times out.
The data is from a pcap file, extracted using tshark -T ek
.
Query executed (retrieved from slowlog)
GET packets-2017-06-01/_search
{
"size":500,
"query":{
"bool":{
"must":[
{
"query_string": {
"query":"*",
"fields":[],
"use_dis_max":true,
"tie_breaker":0.0,
"default_operator":"or",
"auto_generate_phrase_queries":false,
"max_determinized_states":10000,
"enable_position_increments":true,
"fuzziness":"AUTO",
"fuzzy_prefix_length":0,
"fuzzy_max_expansions":50,
"phrase_slop":0,
"analyze_wildcard":true,
"escape":false,
"split_on_whitespace":true,
"boost":1.0
}
},
{
"range": {
"timestamp": {
"from":null,
"to":null,
"include_lower":true,
"include_upper":true,
"boost":1.0
}
}
}
],
"adjust_pure_negative":true,
"boost":1.0
}
},
"version":true,
"_source":{
"includes":[],
"excludes":[]
},
"stored_fields":"*",
"docvalue_fields":["timestamp"],
"script_fields":{},
"sort":[
{"timestamp":
{"order":"desc",
"unmapped_type":"boolean"
}}
],
"aggregations":{
"2":{
"date_histogram":{
"field":"timestamp",
"time_zone":"America/Los_Angeles",
"interval":"135ms",
"offset":0,
"order":{
"_key":"asc"
},
"keyed":false,
"min_doc_count":1
}
}
},
"highlight":{
"pre_tags":["@kibana-highlighted-field@"],
"post_tags":["@/kibana-highlighted-field@"],
"fragment_size":2147483647,
"fields":{
"*":{
"highlight_query":{
"bool":{
"must":[
{
"query_string":{
"query":"*",
"fields":[],
"use_dis_max":true,
"tie_breaker":0.0,
"default_operator": "or",
"auto_generate_phrase_queries":false,
"max_determinized_states":10000,
"enable_position_increments":true,
"fuzziness":"AUTO",
"fuzzy_prefix_length":0,
"fuzzy_max_expansions":50,
"phrase_slop":0,
"analyze_wildcard":true,
"escape":false,
"split_on_whitespace":true,
"all_fields":true,
"boost":1.0
}
},
{
"range":{
"timestamp":{
"from":1331901000000,
"to":1331901006792,
"include_lower":true,
"include_upper":true,
"format":"epoch_millis",
"boost":1.0
}
}
}
],
"adjust_pure_negative":true,
"boost":1.0
}
}
}
}
}
}
Response when executed in Console (truncated)
{
"took": 100490,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 6666,
"max_score": null,
"hits": [
Response from same query, without highlight
{
"took": 114,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 6666,
"max_score": null,
"hits": [
100 seconds vs. 114 milliseconds is quite the difference on a dataset that comfortably fits into memory.
cc @lukasolson
100 seconds vs. 114 milliseconds is quite the difference on a dataset that comfortably fits into memory.
Wow I'll say. Is it just as slow if you query for something in a specific field instead of *
? Is there an easy way to reproduce your data set?
Is it just as slow if you query for something in a specific field instead of *? Is there an easy way to reproduce your data set?
If I change the highlight_query
to just one field and turn of all_fields
it's faster - about 5s. Alternatively, reducing to size: 5
(from 500) gets me to 1s - seems it scales linearly.
Here's the first 10K lines of the dataset as a bulk-insertable file: https://s3.amazonaws.com/cwurm/public/file.aaaaaaaaaa
To insert
I had to do some configuration before it would go in:
Since tshark
generates some duplicate fields I disabled strict duplicate checking in jvm.options
:
-Des.xcontent.strict_duplicate_detection=false
Mapping:
PUT packets-2017-06-01
{
"settings": {
"index": {
"mapping.total_fields.limit": 2000,
"search.slowlog.threshold.fetch.debug" : "500ms",
"search.slowlog.threshold.fetch.info" : "800ms",
"search.slowlog.threshold.fetch.trace" : "200ms",
"search.slowlog.threshold.fetch.warn" : "1s",
"search.slowlog.threshold.query.debug" : "2s",
"search.slowlog.threshold.query.info" : "5s",
"search.slowlog.threshold.query.trace" : "500ms",
"search.slowlog.threshold.query.warn" : "10s",
"number_of_shards": 2,
"number_of_replicas": 0,
"refresh_interval": "5s"
}
},
"mappings": {
"pcap_file": {
"properties": {
"timestamp": {
"type": "date"
}
}
}
}
}
Background
I took a large pcap file and used Wireshark's tshark -T ek
to generate this file for Elasticsearch.
tshark's ES export is by no means optimal. It exports all fields as strings and Elasticsearch's dynamic mapping will turn those into text
+keyword
multi-fields (most values are under 256 chars). It will export fields of all layers (ip, tcp, http, smb, etc.) of the packets into the documents - so there end up being a lot of sparse fields in the higher-level protocol layers. Each document is quite small - it represents a packet frame and will be maybe 2-3 KB max in most cases.
The field cardinality of some network protocols is huge (e.g. Wireshark's docs list 733 possible fields for SMB alone: https://www.wireshark.org/docs/dfref/) - that's how I ended up with an index of 950 fields in the mapping.
I have a similar issue with ES/Kibana 5.4.0
I'm indexing HTTP access logs but including all request and response headers, this results in a large number of fields.
Doing a search in Kibana for '*' generates a query to ES with
{
"query_string": {
"analyze_wildcard": true,
"query": "*"
}
},
Removing that term from the query results in about a 10x speedup and no change to the results.
Having ES profile the query shows it running a ConstantScoreQuery
and TermQuery
against each field as a child of the "*" search.
I'm not really sure if this is better fixed in the ES query planner, Kibana's query builder or both?
My workaround is to change my saved searches in Kibana to include <field_thats_always_present>: *
and disable highlighting.
@hamishforbes we have a separate issue for that problem: https://github.com/elastic/kibana/issues/12097
Most likely we'll replace the default "*" query with a match_all query behind the scenes.
Ah! Thank you 馃憤 I did try looking for a more specific issue but it turns out searching for 'query *' isn't so easy!
Thanks for the detailed steps @cwurm
@lukasolson would you wanna check this out since you worked on highlighting before?
@lukasolson while you're working on this you may want to take a crack at https://github.com/elastic/kibana/issues/12097 as well. I think the combination of these two issues is really killing us. https://github.com/elastic/kibana/issues/12097 should be a simple switch from a query_string_query for * to a match_all as the default query.
PSA for any users currently affected by this: a workaround would be to set doc_table:highlight:all_fields
and/or doc_table:highlight
to false in Kibana's advanced settings.
I'm having an issue where I have a saved search on my dashboard. First, the fastest improvement was changing the discover:sampleSize from 10000 to 2000 (hehe I know it was very high but I enjoyed having more results bucketed on the left side panel in discover). It reduced from about 50s duration to 9s. Now when I change the default * on search toolbar on the dash, the duration doesn't change much (that may be because my visuals already have queries built in). Changing highlight from true to false brought the 9s down 2-3s which is helpful. Just thought I'd add to this thread if it's any help.
Closing this out as we've made some improvements, both to the default search, and to highlighting, so this shouldn't be an issue any more. In my tests it seems to perform much better.
Most helpful comment
PSA for any users currently affected by this: a workaround would be to set
doc_table:highlight:all_fields
and/ordoc_table:highlight
to false in Kibana's advanced settings.