Elasticsearch: Wrap stacked tokens in `match` query in a BlendedTerms query for better scoring

Created on 30 Dec 2014 · 14Comments · Source: elastic/elasticsearch

Stacked tokens (tokens in the same position) in a match query usually represent alternatives, eg query-time synonym expansion, fuzzy terms, etc

These queries tend to favour the rarer terms, which (esp with fuzzy queries) is likely to be the wrong choice (see #5883 and #3125).

From https://github.com/elasticsearch/elasticsearch/pull/8352#issuecomment-61847572

The BlendedTermQuery should be used whenever two query terms are synonyms of each other and should be treated as 'one thing'. It tries to adjust statistics independently of the scoring function (which may have no concept of IDF) to deal with the problem.

But I think for it to work, it would need per-term boost support? Then we need a rewrite method that can build this instead of BooleanQuery, it would look a lot like the boolean one: https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/MultiTermQuery.java#L140

Per-term boost support is required to be able to take the fuzzy edit distance into account.

:SearcSearch >enhancement help wanted

Source

clintongormley

Most helpful comment

awesome thank you !

jeantil on 14 Dec 2016

❤2

All 14 comments

@markharwood please could you take a look at this one

clintongormley on 8 May 2015

Being a fundamental ranking issue, I think this should be addressed upstream in Lucene.
I have a Lucene patch in progress that is adding a new MultiTermQuery rewrite method. It contains the logic to balance factors like IDF when there are many (typically automated) expansions of user criteria. We need to overcome Lucene's natural bias towards ranking the rare variants top in these cases and go for a more balanced use of IDF.

Once we've redefined what core features Lucene offers we can work through the implications for the elasticsearch functionality that wraps it.

markharwood on 8 May 2015

@markharwood +1

clintongormley on 8 May 2015

whythecode on 16 Sep 2015

@clintongormley @markharwood while checking up on the status for #10391 I noticed it was merged and thus FLT is not available anymore in 2.0.0.
The lucene patch seems to have been accepted does it mean this issue is resolved and we can hope to have a ranking on the normal fuzzy query which doesn't rank typos over exact matches ?

Once we've redefined what core features Lucene offers we can work through the implications for the elasticsearch functionality that wraps it.

This sentence in particular suggests there may be more work to be done on the ES side before the ranking is actually "fixed"

jeantil on 6 Nov 2015

(Lucene) fuzzy queries should rank sensibly now in 2.0. However I'm unclear on the behaviour of match query and synonyms so I would not suggest closing this issue without further investigation.

markharwood on 6 Nov 2015

Any updates for this issue?

ptgamr on 15 Aug 2016

+1 Any updates?

agwidarsito on 24 Aug 2016

"Fuzzy" term expansions are fixed at the core Lucene level now [1].
Multi-field expansions are fixed in multi-match's cross_field query mode.

Synonym expansions are likely harder to fix because they are part of the analysis phase and not the query phase. If you use a synonym analyzer at index-time then the IDF will naturally be blended: st.==street==st when it comes to doc frequencies. With query-time synonym expansion it is impossible for the query parser to know if the user typed "st OR street" or if the choice of analyzer injected that on their behalf so we can't make good decisions.

[1] https://issues.apache.org/jira/browse/LUCENE-329

markharwood on 24 Aug 2016

Stacked terms are now wrapped in a synonyms query

clintongormley on 26 Nov 2016

Is it safe to assume that this is now available in 5.1 ?

thanks

jeantil on 14 Dec 2016

yes

clintongormley on 14 Dec 2016

awesome thank you !

jeantil on 14 Dec 2016

❤2

Could this feature be used for EdgeNGrams with MultiMatch query? (We can't use CrossFields when using NGrams) We don't want wanr to show up before want when searching for "want".