Elasticsearch: Disallow the `classic` (TF-IDF) similarity on 6.0 indices.

Created on 16 Feb 2017  路  14Comments  路  Source: elastic/elasticsearch

BM25 should generally perform better than TF-IDF. Moreover, Lucene is removing coords and query normalization in 7.0 (that Elasticsearch will be based on) so we should start deprecating the classic similarity.

:SearcSearch >deprecation discuss good first issue

Most helpful comment

This feature of Lucene's TF-IDF similarity can't be reimplemented with a script indeed. For the record, note that this isn't part of the official definition of TF-IDF but something that has been added on top in order to work around the fact that the TF weighting would allow a document that contains many occurrences of a single query term to score better than documents that contain all query terms. This is no longer an issue with BM25 (and most other modern similarities) whose TF weighting is saturated.

All 14 comments

@jpountz I made this an adoptme as it seems like something we intend to do rather than something that needs further discussion

Hi, I'm new to the project and would like to get started on this issue if no one else is already working on it.

Adding the discuss label to figure out whether we should just disallow it on new indices, or also add it in a plugin for users who might really really need TF-IDF scoring.

@jpountz can we really support it without query coordination? If not, then I'd opt for removing it. (Even if we can I'm leaning towards removing it)

The removal of query coordination might significantly decrease the quality of this similarity indeed. @eskibars Maybe you have some pespective on this, I think you suggested some users might rely specifically on TF-IDF in the last search&aggs meeting?

@eskibars Maybe you have some pespective on this, I think you suggested some users might rely specifically on TF-IDF in the last search&aggs meeting?

I'm certain some people rely on the full old/classic TF-IDF implementation, but in a few different ways that I'll tease apart.

  1. We've seen people in our forums mention they have regression tests that include ordering and my own history leads me to believe that while these users will be the minority, they are not terribly uncommon and they often have extensive UAT / and long UAT cycles, which is why they built the tests in the first place. For these users, dropping classic similarity is going to mean one of 2 things: simply avoiding the upgrade entirely and sticking around on 5.x for as long as possible or upgrading a test environment and then re-engaging their long UAT cycles, potentially failing them and having to go edit queries, etc as they go through these cycles.
  2. To a much lesser extent, I've anecdotally heard some users with a fixed catalog of data include the numeric score calculations of some specific queries in their regression tests.
  3. There are people that are convinced TF-IDF is better for their data+queries than BM25. I've heard this a number times: sometimes where the user has done a lot of testing and shown that TF-IDF is better in one way or another for them and sometimes not. I also believe TF-IDF also has more literature around it, is taught in more schools, and is generally a bit easier to understand (the formula is more or less in the name) so there may be some resistance purely due to how understood the two are relative to one another.

To be clear, I think that the vast majority of users do not fall in any of these categories and I've seen BM25 does perform better in the vast majority of UAT (especially in text search) and is built on sound principals avoiding keyword stuffing, etc. Given that Lucene is dropping coord, I'm not sure what we can do for users falling in categories 1 and 2 other than brace them for the impact ASAP. For users in the third category, one thing I was thinking was the possibility of pulling classic similarity out into a plugin that we could bootstrap for the community (and put it in the community's hands to support). No matter what, we also need to accept that for some portion of users, this change will delay (to change queries / go through UAT again) or entirely keep them from upgrading (if they decide they "need" to keep the old behavior).

I'm wondering whether the scripted similarity feature that we discussed a couple times could be a good workaround rather than creating a plugin. Reimplementing TF-IDF could actually be a good documentation example of it.

Hello I fall into this category and I have empirical testing to back it up. Please contact me if you have any questions. For my data/application for Elasticsearch TF-IDF produces a better MRR (mean reciprocal rank, a common IR score for search engines) than BM25. If TF-IDF is not retained then I will be stuck on version 5.

There are people that are convinced TF-IDF is better for their data+queries than BM25. I've heard this a number times: sometimes where the user has done a lot of testing and shown that TF-IDF is better in one way or another for them and sometimes not. I also believe TF-IDF also has more literature around it, is taught in more schools, and is generally a bit easier to understand (the formula is more or less in the name) so there may be some resistance purely due to how understood the two are relative to one another.

We will make this deprecation smoother by adding a scripted similarity: https://github.com/elastic/elasticsearch/pull/25831.

@jpountz I'm not entirely sure about the current state of this issue, is the "discuss" label still appropriate? What are the next steps if not?
cc @elastic/es-search-aggs

Sorry to leave a comment on an older issue, but I was wondering how the "scripts" solution would help replacing the missing coordinating factors? It seems that to be able to reproduce scoring that are aligned with how the previous classic similarity calculated them, the coordinating factors would be essential, and as far as I understand, those were calculated in the boolean weights. Thank you

This feature of Lucene's TF-IDF similarity can't be reimplemented with a script indeed. For the record, note that this isn't part of the official definition of TF-IDF but something that has been added on top in order to work around the fact that the TF weighting would allow a document that contains many occurrences of a single query term to score better than documents that contain all query terms. This is no longer an issue with BM25 (and most other modern similarities) whose TF weighting is saturated.

Thanks a lot @jpountz for the reply. I understand that coords were a work around for the fact that TF-IDF isn't great at favoring documents that matches all terms over a document that only matches a single term multiple time. I also understand that BM25 has better TF saturation which naturally helps with those scenarios. Unfortunately, I have the challenging goal of customizing the classic similarity algorithm in Es7 (Lucene8) such as the resulting scores are the same as Es5 (Lucene6, when coords were still a thing). I was originally hopping that coords could somewhat be re-implemented externally (or through the scripting features), but the concept of coords is so entrangled with the various Lucene scoring classes (boolean scorers, weights, parsers) that its proven to be much more challenging than I originally hopped.

Replicating version 5 scoring in recent versions is not possible.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ppf2 picture ppf2  路  3Comments

malpani picture malpani  路  3Comments

matthughes picture matthughes  路  3Comments

rpalsaxena picture rpalsaxena  路  3Comments

clintongormley picture clintongormley  路  3Comments