Elasticsearch: Introduce guards against too many clauses in SpanQuery

Created on 3 Sep 2018  ยท  6Comments  ยท  Source: elastic/elasticsearch

Running query_string query with "type" = "phase" leads to java.lang.OutOfMemoryError.
This happens when a query string produces a lot of tokens which results in spanOR query with a huge number of spanNear clauses.

Step to reproduce:
Elasticsearch version (bin/elasticsearch --version): 6.4/6.3.2

PUT dos_test 
{
  "mappings": {
    "company": {
      "properties": {
        "word": {
          "analyzer": "word_analyzer",
          "type": "text"
        }
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "word_analyzer": {
            "tokenizer": "kuromoji_search_tokenizer",
            "type": "custom"
          }
        },
        "tokenizer": {
          "kuromoji_search_tokenizer": {
            "mode": "search",
            "nbest_cost": 10000,
            "type": "kuromoji_tokenizer"
          }
        }
      }
    }
  }
} 

This query leads to OOM:

GET dos_test/_search 
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚",
            "fields": [
              "word"
            ],
            "type": "phrase",
            "default_operator": "AND"
          }
        }
      ]
    }
  }
}
:SearcSearch >bug

Most helpful comment

@mayya-sharipova I opened https://issues.apache.org/jira/browse/LUCENE-8479 in Lucene.
The QueryBuilder should detect these bad queries and throw TooManyClauses when the expansion of a phrase query produces too many paths. Note that the JapaneseTokenizer can create big graphs depending on the value of nbest_cost, in the provided example setting this option to 10,000 is causing issues when phrase query are built since we don't limit the number of expanded paths.

All 6 comments

Analyzing a shorted version of their query string:

GET test-index/_analyze
{
  "field" : "word",
  "text" : "ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚"
}

results in:

{
    "tokens": [
        {
            "token": "ใ‚",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "ใ‚ใ‚",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0,
            "positionLength": 2
        },
        {
            "token": "ใ‚",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 1
        },
....

The 20-letters word "ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚ใ‚", results in 39 tokens.

Searching a shortened version of their query string:

{
  "query": {
    "query_string": {
      "fields" : ["word"],
      "query" : "ใ‚ใ‚ใ‚ใ‚ใ‚",
      "type" : "phrase"
    }
  }
}

results in a massive Span query:

"explanation": "spanOr([spanNear([word:ใ‚, word:ใ‚, word:ใ‚, word:ใ‚, word:ใ‚], 0, true), spanNear([word:ใ‚, word:ใ‚, word:ใ‚, word:ใ‚ใ‚], 0, true), spanNear([word:ใ‚, word:ใ‚, word:ใ‚ใ‚, word:ใ‚], 0, true), spanNear([word:ใ‚, word:ใ‚ใ‚, word:ใ‚, word:ใ‚], 0, true), spanNear([word:ใ‚, word:ใ‚ใ‚, word:ใ‚ใ‚], 0, true), spanNear([word:ใ‚ใ‚, word:ใ‚, word:ใ‚, word:ใ‚], 0, true), spanNear([word:ใ‚ใ‚, word:ใ‚, word:ใ‚ใ‚], 0, true), spanNear([word:ใ‚ใ‚, word:ใ‚ใ‚, word:ใ‚], 0, true)])"

A longer version of this query string will produce much more clauses.

Pinging @elastic/es-search-aggs

Adding a discuss label, as I am not sure if we still want to work with Span queries after Alan introduces his match queries.

@mayya-sharipova I opened https://issues.apache.org/jira/browse/LUCENE-8479 in Lucene.
The QueryBuilder should detect these bad queries and throw TooManyClauses when the expansion of a phrase query produces too many paths. Note that the JapaneseTokenizer can create big graphs depending on the value of nbest_cost, in the provided example setting this option to 10,000 is causing issues when phrase query are built since we don't limit the number of expanded paths.

removing discuss label, as a patch is already available in Lucene

@mayya-sharipova this seems to fail in 6.5 now with the introduced "too_many_clauses" exception, so I think this issue can be closed. Please reopen if you think there is anything left to do.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rbraley picture rbraley  ยท  67Comments

casperOne picture casperOne  ยท  102Comments

clintongormley picture clintongormley  ยท  55Comments

kul picture kul  ยท  72Comments

bubo77 picture bubo77  ยท  43Comments