Running query_string query with "type" = "phase" leads to java.lang.OutOfMemoryError.
This happens when a query string produces a lot of tokens which results in spanOR query with a huge number of spanNear clauses.
Step to reproduce:
Elasticsearch version (bin/elasticsearch --version): 6.4/6.3.2
PUT dos_test
{
"mappings": {
"company": {
"properties": {
"word": {
"analyzer": "word_analyzer",
"type": "text"
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"word_analyzer": {
"tokenizer": "kuromoji_search_tokenizer",
"type": "custom"
}
},
"tokenizer": {
"kuromoji_search_tokenizer": {
"mode": "search",
"nbest_cost": 10000,
"type": "kuromoji_tokenizer"
}
}
}
}
}
}
This query leads to OOM:
GET dos_test/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "ใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใใ",
"fields": [
"word"
],
"type": "phrase",
"default_operator": "AND"
}
}
]
}
}
}
Analyzing a shorted version of their query string:
GET test-index/_analyze
{
"field" : "word",
"text" : "ใใใใใใใใใใใใใใใใใใใใ"
}
results in:
{
"tokens": [
{
"token": "ใ",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "ใใ",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0,
"positionLength": 2
},
{
"token": "ใ",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
....
The 20-letters word "ใใใใใใใใใใใใใใใใใใใใ", results in 39 tokens.
Searching a shortened version of their query string:
{
"query": {
"query_string": {
"fields" : ["word"],
"query" : "ใใใใใ",
"type" : "phrase"
}
}
}
results in a massive Span query:
"explanation": "spanOr([spanNear([word:ใ, word:ใ, word:ใ, word:ใ, word:ใ], 0, true), spanNear([word:ใ, word:ใ, word:ใ, word:ใใ], 0, true), spanNear([word:ใ, word:ใ, word:ใใ, word:ใ], 0, true), spanNear([word:ใ, word:ใใ, word:ใ, word:ใ], 0, true), spanNear([word:ใ, word:ใใ, word:ใใ], 0, true), spanNear([word:ใใ, word:ใ, word:ใ, word:ใ], 0, true), spanNear([word:ใใ, word:ใ, word:ใใ], 0, true), spanNear([word:ใใ, word:ใใ, word:ใ], 0, true)])"
A longer version of this query string will produce much more clauses.
Pinging @elastic/es-search-aggs
Adding a discuss label, as I am not sure if we still want to work with Span queries after Alan introduces his match queries.
@mayya-sharipova I opened https://issues.apache.org/jira/browse/LUCENE-8479 in Lucene.
The QueryBuilder should detect these bad queries and throw TooManyClauses when the expansion of a phrase query produces too many paths. Note that the JapaneseTokenizer can create big graphs depending on the value of nbest_cost, in the provided example setting this option to 10,000 is causing issues when phrase query are built since we don't limit the number of expanded paths.
removing discuss label, as a patch is already available in Lucene
@mayya-sharipova this seems to fail in 6.5 now with the introduced "too_many_clauses" exception, so I think this issue can be closed. Please reopen if you think there is anything left to do.
Most helpful comment
@mayya-sharipova I opened https://issues.apache.org/jira/browse/LUCENE-8479 in Lucene.
The
QueryBuildershould detect these bad queries and throw TooManyClauses when the expansion of a phrase query produces too many paths. Note that the JapaneseTokenizer can create big graphs depending on the value ofnbest_cost, in the provided example setting this option to 10,000 is causing issues when phrase query are built since we don't limit the number of expanded paths.