Elasticsearch: MatchQuery and GraphTokenStreamFiniteStrings can generate too many token streams and causes OOM while parsing

Created on 8 Mar 2017  路  2Comments  路  Source: elastic/elasticsearch

I'm still investigating the issue but a simple match query can cause the node to die in a gc spiral in certain conditions.
(Note this seems to affect 5.2.x but not 5.1.x)

the test case is here: https://gist.github.com/nomoa/96506d97dc582e79e6ff5c5511a3d702

It triggers the problem by creating an index with a shingle analyzer.
It issues a query with 84 unigrams with a 3grams shingle analyzer and a token limit set to 60: the query fails on the max boolean clause. Debugging this I found that it generates more than 39000 tokenStreams.

The last query simply run without a token limit, it then fails with OOM (Xmx set to 512m).

Is it expected?
The token limit filter appears to help to workaround the issue but it's hard to properly set its value.
Would it make sense to add some circuit breakers earlier and bail early if the graph tries to generate more streams than a boolean query can accept?

Can we bypass the graph analysis in some circumstances, what are the benefits of running a graph analysis if occur is SHOULD and min should match is 1?

:SearcSearch >bug

Most helpful comment

@nomoa yes this is expected in 5.2 since we added the support for graph query in this version.

Can we bypass the graph analysis in some circumstances, what are the benefits of running a graph analysis if occur is SHOULD and min should match is 1?

The benefit is that you can have multi-term synonyms like "new york" that must match "new" AND "york" even when occur is set to SHOULD. The shingle filter with unigrams is problematic since it creates side paths at every position.
But I agree that we should detect this situation earlier and fall back to the normal query analysis when the graph has too many side paths. I'll work on a solution, thanks for reporting this.

All 2 comments

@nomoa yes this is expected in 5.2 since we added the support for graph query in this version.

Can we bypass the graph analysis in some circumstances, what are the benefits of running a graph analysis if occur is SHOULD and min should match is 1?

The benefit is that you can have multi-term synonyms like "new york" that must match "new" AND "york" even when occur is set to SHOULD. The shingle filter with unigrams is problematic since it creates side paths at every position.
But I agree that we should detect this situation earlier and fall back to the normal query analysis when the graph has too many side paths. I'll work on a solution, thanks for reporting this.

Was this page helpful?
0 / 5 - 0 ratings