Elasticsearch: Fuzzy query ranks misspellings over exact for repeated "close" tokens

Created on 23 Jan 2017 · 11Comments · Source: elastic/elasticsearch

Elasticsearch version: 5.1.2

Plugins installed: [] (no plugins installed)

JVM version :OpenJDK 64-Bit Server VM/1.8.0_111/25.111-b14

OS version: MacOS sierra 10.12.2 (16C67)

Description of the problem including expected versus actual behavior:
Despite #9103, match/multimatch ranks misspellings over exact matches in some fairly precise conditions.

Steps to reproduce:

DELETE /t

PUT /t
{
  "settings": {
    "number_of_shards": 1
  }
}
# You must have at least one non matching document to trigger the problem. The more non matching documents in the index, the worse the score difference.
POST /t/t
{"foo": "foo bar"}

POST /t/t
{"foo": "aab abb"}
POST /t/t
{"foo": "aab abb"}

POST /t/t
{"foo": "aae abb"}


GET /_search
{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "aae abb",
            "fields": "foo",
            "fuzziness": "AUTO"
          }
        }
      ]
    }
  }
}

I expect

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.94192845,
    "hits": [
      {
        "_index": "t",
        "_type": "t",
        "_id": "AVnLTHvstUtjT09CXNTY",
        "_score":  0.94192845,
        "_source": {
          "foo": "aae abb"
        },
      {
        "_index": "t",
        "_type": "t",
        "_id": "AVnLTHsZtUtjT09CXNTW",
        "_score": 0,73462508,
        "_source": {
          "foo": "aab abb"
        }
      },
      {
        "_index": "t",
        "_type": "t",
        "_id": "AVnLTHuGtUtjT09CXNTX",
        "_score": 0,73462508,
        "_source": {
          "foo": "aab abb"
        }
      }
      }
    ]
  }
}

I get

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.9479706,
    "hits": [
      {
        "_index": "t",
        "_type": "t",
        "_id": "AVnLTHsZtUtjT09CXNTW",
        "_score": 0.9479706,
        "_source": {
          "foo": "aab abb"
        }
      },
      {
        "_index": "t",
        "_type": "t",
        "_id": "AVnLTHuGtUtjT09CXNTX",
        "_score": 0.9479706,
        "_source": {
          "foo": "aab abb"
        }
      },
      {
        "_index": "t",
        "_type": "t",
        "_id": "AVnLTHvstUtjT09CXNTY",
        "_score": 0.94192845,
        "_source": {
          "foo": "aae abb"
        }
      }
    ]
  }
}

Looking at the explain. The norms factor is always the same: 0.89722675 (I'll call it $norms)
it seems that for aab abb the score 0.9479706

{
  0.6931472 ~ [aab]*0.6666666 ~ [fuzzy of aae?])*$norms + 
 {
  0.35667494 ~ [aab]*0.6666666 ~ [fuzzy of abb?]*$norms +
  0.35667494 ~ [exact abb]*$norms
 } 
}

Whereas for aae abb I get a score of 0.9419285 as

{
  0.6931472~[exact aae])*$norms +  
  0.35667494~[exact abb]*$norms
}

:SearcSearch >bug

Source

jeantil

👍1

All 11 comments

Fuzziness is behaving correctly here. The idea is that words will be spelled correctly most of the time, with a few misspellings, so with fuzziness we give a slight edge to terms that appear more frequently, as this is more likely to be the correct spelling.

If your correct spelling is the only occurrence in the document collection, then it's going to rank more poorly

clintongormley on 23 Jan 2017

Let me rephrase so I am sure I understand :
You are saying the reason why abb gets score for fuzzy aab _and_ exact abb while aae gets score only for exact aae and not for fuzzy on aab is to give the edge to the term appearing most frequently event if it is a fuzzy match ?

This is effectively overriding an exact match with a fuzzy match which runs counter to what both #9105 ( match/multi_match +fuzzy replace FLT ) and #5883 (don't rank fuzzy above exact match) were saying...

Is there a way to configure this behaviour ?
With FLT there was an option to disable TF/IDF for specific use cases. I would like an option to always rank exact matches over fuzzy matches.

My exact use case is searching for cars and this bug breaks the search for porsche cars ranking the 911 type 991 over the 911 type 997 for people who searched for 911 997. We have been stuck on 1.7.X ever since the fuzzy like this search was removed and I was really looking forward to being able to get back on the upgrade trains with #9103 fixed ...

jeantil on 23 Jan 2017

👍1

The more I think about this the worse it sounds :
"A correctly spelled but rarer term is less relevant than a misspelled but more frequent term ? " that's one hell of a confirmation bias ... you would never see rarer results shadowed by a more frequent spelling. ( in my specific car example I have over 40 911 type 991 results before the first correctly spelled case this would mean page 4 on google ...

So please consider this a feature request to be able to tweak the relevance model.

jeantil on 23 Jan 2017

👍1

@jeantil it doesn't work like that. What it should do is say:

which terms are within edit distance 1/2 of xxx
find the min IDF of all the fuzzy terms
use that IDF for all terms (with a tiny boost for the current term)

So this is working correctly. However, I think there is a different bug: the more fuzzy terms that match, the higher the score.

With explain, the score for the first two matching documents includes this:

            {
              "value": 0.5333638,
              "description": "sum of:",
              "details": [
                {
                  "value": 0.21334553,
                  "description": "weight(foo:aab in 1) [PerFieldSimilarity], result of:",
                  "details": []
                    }
                  ]
                },
                {
                  "value": 0.3200183,
                  "description": "weight(foo:abb in 1) [PerFieldSimilarity], result of:",
                  "details": []
                }
              ]
            }

While the last matching doc has just this:

            {
              "value": 0.3200183,
              "description": "sum of:",
              "details": [
                {
                  "value": 0.3200183,
                  "description": "weight(foo:abb in 3) [PerFieldSimilarity], result of:",
                  "details": []
                }
              ]
            }

That's just wrong. Instead, we should be taking the max score for each of these fuzzy "synonyms".

@jimczi what do you think?

clintongormley on 23 Jan 2017

Oh I'm stupid - abb is matching both terms (aab and abb), that's why it is being summed.

Not a bug. Closing

clintongormley on 23 Jan 2017

Right the query matches the fuzzy terms twice. It's more obvious with the validate API:

````
GET /_validate/query?explain&rewrite
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "aae abb",
"fields": "foo",
"fuzziness": "AUTO",
"fuzzy_rewrite": "top_terms_blended_freqs_10"
}
}
]
}
}
}

{
"valid": true,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"explanations": [
{
"index": "t",
"valid": true,
"explanation": "((foo:aab)^0.6666666 foo:aae) ((foo:aab)^0.6666666 foo:abb)"
}
]
}
`````
The boost is computed based on the distance with the exact string.

jimczi on 23 Jan 2017

While I understand the current logic, I still argue that it effectively ends up ranking an exact match below a fuzzy match because it matches the same input token twice again different document tokens even when one of the document token has an exact match with a differnt input token.

here abb in the input is an exact match for abb in the document, yet aab is fuzzy matched with both aae and abb ?

@clintongormley Is there any way to tweak the ranking algorithm to avoid scoring an input token against an output token which already has an exact match (or anything which would alway result in full exact matches to actually be ranked above partial exact match+fuzzy match really) ?

@jimczi the validate result seems very weird as it doesn't account for the exact match abb<->abb does this stem from using a specific rewrite top_terms_blended_freqs_10 ?

jeantil on 23 Jan 2017

@jeantil the validate query uses top_terms_blended_freqs_10 which is the default rewrite method (except that we keep the top 50 by default). If you want to have exact match always first, you can add the same match query without the fuzziness and with the operator set to AND and boost that query with a ridiculous number. Though your example is really an edge case, if your language produces only three letter words with an alphabet of two letters I understand that you may have some problems with the fuzziness but that should not be a problem with real words. You can also set the prefix_length > 0 in order to skip small words in the fuzzy scoring.

jimczi on 24 Jan 2017

that should not be a problem with real words

yet I did provide a very real world example of why this issue is important to me. We have a search for cars from all manufacturers. You probably heard of the Porshe 911 in the real world. Except that's a fairly old line of cards so porsche created multiple versions of the Porsche 911 :

911 type 911
911 type 930
911 type 964
911 type 991
911 type 993
911 type 996
911 type 997

According to our domain experts, people "in the know" talk and search these cars using 911 997guess what happens ? using ES 5.x our search returns the 40 different motorisations of 911 type 991 before the first 911 type 997.

This is not the only brand causing issues. BMW classifies cars as Serie 1, Serie 2, Serie 3, etc ... then goes to decline the series in different categories (E30) Berline, (E36) Berline, (E46) cabrio, etc
But that's only in the catalog, on the cards they write 320d, 318i, 316xxx

And people use that to refer to their cars and therefore to search for it, I'll let you guess what happened to my test cases when I switched to 1.7.x with the now retired Fuzzy Like This to 5.x with a fuzzy multimatch. At the same time, in the same index, we have brands who enjoy much longer names in which typos are easily found.

I took the time to create an easily reproductible test case but that was driven by a very real world issue.

by the way our current query is already heavily tweaked:

GET vehicle_fr_fr/cartype/_search?search_type=dfs_query_then_fetch
{
  "size" : 100,
  "query" : {
    "bool" : {
      "should" : [ {
        "flt" : {
          "fields" : [ "motor_search", "motor_keywords", "cartype_search", "cartype_keywords", "segment_search", "segment_keywords", "maker_search", "maker_keywords" ],
          "like_text" : "energy saver+",
          "fuzziness" : "AUTO",
          "ignore_tf" : true,
          "boost" : 0.5
        }
      }, {
        "constant_score" : {
          "query" : {
            "multi_match" : {
              "query" : "energy saver+",
              "fields" : [ "phrase" ],
              "type" : "phrase"
            }
          },
          "boost" : 3.0
        }
      }, {
        "constant_score" : {
          "query" : {
            "multi_match" : {
              "query" : "energy saver+",
              "fields" : [ "motor_search", "cartype_search", "segment_search", "maker_search" ]
            }
          },
          "boost" : 3.0
        }
      }, {
        "constant_score" : {
          "query" : {
            "multi_match" : {
              "query" : "energy saver+",
              "fields" : [ "motor_keywords", "cartype_keywords", "segment_keywords", "maker_keywords" ]
            }
          },
          "boost" : 3.0
        }
      } ]
    }
  },
  "highlight" : {
    "pre_tags" : [ "{" ],
    "post_tags" : [ "}" ],
    "require_field_match" : true,
    "fields" : {
      "phrase" : {
        "type" : "fvh"
      },
      "motor_search" : {
        "type" : "fvh"
      },
      "motor_keywords" : {
        "type" : "fvh"
      },
      "cartype_search" : {
        "type" : "fvh"
      },
      "cartype_keywords" : {
        "type" : "fvh"
      },
      "segment_search" : {
        "type" : "fvh"
      },
      "segment_keywords" : {
        "type" : "fvh"
      },
      "maker_search" : {
        "type" : "fvh"
      },
      "maker_keywords" : {
        "type" : "fvh"
      }
    }
  }
}

jeantil on 24 Jan 2017

Sorry my wording was not clear enough. You should not try to apply fuzziness to a three letter word. If people search for 911 997 you can maybe make the words optional but applying fuzziness to this query is problematic for scoring even when the max distance is 1.
I can see how the "rank exact match first" can be useful but it would require a special query for this purpose. Currently each clause is independent from each other so they can only boost exact match at the word level. Having multiple clauses that match the same fuzzy word is an edge case that you can counterbalance by removing problematic words.
We can continue the discussion on the forum if you want:
https://discuss.elastic.co/c/elasticsearch
You have some nice cars in your parking ;)

jimczi on 24 Jan 2017

Thanks for taking the time. I fail to understand how I can "remove the problematic words" since they convey 100% of the information I am trying to retrieve. I have created a topic on the forum to continue this disussion:
https://discuss.elastic.co/t/fuzzy-query-ranks-misspellings-over-exact-for-repeated-close-tokens/72605

Thanks again for your help.

jeantil on 24 Jan 2017

Was this page helpful?

0 / 5 - 0 ratings