Elasticsearch: Introduce guards against too many clauses in SpanQuery

Created on 3 Sep 2018 · 6Comments · Source: elastic/elasticsearch

Running query_string query with "type" = "phase" leads to java.lang.OutOfMemoryError.
This happens when a query string produces a lot of tokens which results in spanOR query with a huge number of spanNear clauses.

Step to reproduce:
Elasticsearch version (bin/elasticsearch --version): 6.4/6.3.2

PUT dos_test 
{
  "mappings": {
    "company": {
      "properties": {
        "word": {
          "analyzer": "word_analyzer",
          "type": "text"
        }
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "word_analyzer": {
            "tokenizer": "kuromoji_search_tokenizer",
            "type": "custom"
          }
        },
        "tokenizer": {
          "kuromoji_search_tokenizer": {
            "mode": "search",
            "nbest_cost": 10000,
            "type": "kuromoji_tokenizer"
          }
        }
      }
    }
  }
}

This query leads to OOM:

GET dos_test/_search 
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "ああああああああああああああああああああああああああああああああああああああああ",
            "fields": [
              "word"
            ],
            "type": "phrase",
            "default_operator": "AND"
          }
        }
      ]
    }
  }
}

:SearcSearch >bug

Source

mayya-sharipova

Most helpful comment

@mayya-sharipova I opened https://issues.apache.org/jira/browse/LUCENE-8479 in Lucene.
The QueryBuilder should detect these bad queries and throw TooManyClauses when the expansion of a phrase query produces too many paths. Note that the JapaneseTokenizer can create big graphs depending on the value of nbest_cost, in the provided example setting this option to 10,000 is causing issues when phrase query are built since we don't limit the number of expanded paths.

jimczi on 3 Sep 2018

👍4 🎉1

All 6 comments

Analyzing a shorted version of their query string:

GET test-index/_analyze
{
  "field" : "word",
  "text" : "ああああああああああああああああああああ"
}

results in:

{
    "tokens": [
        {
            "token": "あ",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "ああ",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0,
            "positionLength": 2
        },
        {
            "token": "あ",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 1
        },
....

The 20-letters word "ああああああああああああああああああああ", results in 39 tokens.

Searching a shortened version of their query string:

{
  "query": {
    "query_string": {
      "fields" : ["word"],
      "query" : "あああああ",
      "type" : "phrase"
    }
  }
}

results in a massive Span query:

"explanation": "spanOr([spanNear([word:あ, word:あ, word:あ, word:あ, word:あ], 0, true), spanNear([word:あ, word:あ, word:あ, word:ああ], 0, true), spanNear([word:あ, word:あ, word:ああ, word:あ], 0, true), spanNear([word:あ, word:ああ, word:あ, word:あ], 0, true), spanNear([word:あ, word:ああ, word:ああ], 0, true), spanNear([word:ああ, word:あ, word:あ, word:あ], 0, true), spanNear([word:ああ, word:あ, word:ああ], 0, true), spanNear([word:ああ, word:ああ, word:あ], 0, true)])"

A longer version of this query string will produce much more clauses.

mayya-sharipova on 3 Sep 2018

Pinging @elastic/es-search-aggs

elasticmachine on 3 Sep 2018

Adding a discuss label, as I am not sure if we still want to work with Span queries after Alan introduces his match queries.

mayya-sharipova on 3 Sep 2018

jimczi on 3 Sep 2018

👍4 🎉1

removing discuss label, as a patch is already available in Lucene

mayya-sharipova on 7 Sep 2018

@mayya-sharipova this seems to fail in 6.5 now with the introduced "too_many_clauses" exception, so I think this issue can be closed. Please reopen if you think there is anything left to do.