Elasticsearch: Completion suggester with fuzziness weird scoring

Created on 26 May 2017 · 12Comments · Source: elastic/elasticsearch

Elasticsearch version:

   "version": {
      "number": "5.1.1",
      "build_hash": "5395e21",
      "build_date": "2016-12-06T12:36:15.409Z",
      "build_snapshot": false,
      "lucene_version": "6.3.0"
   }

Plugins installed: []

JVM version (java -version):
java version "1.8.0_112"
Java(TM) SE Runtime Environment (build 1.8.0_112-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.112-b16, mixed mode)

OS version (uname -a if on a Unix-like system):
OS X (Darwin Kernel Version 16.6.0)

Description of the problem including expected versus actual behavior:

Elasticsearch Completion Suggester documentation states:

Suggestions that share the longest prefix to the query prefix will be scored higher.

A quick tests shows that searching for headi will give the same score to

heading - 4
head    - 4
header  - 4

_With fuzziness 2, max score will be 3_

After some further testings it seems the max relevance score can only be prefix length - fuzziness which is strange.

I would expect heading to have a higher score since it shares 5 letters with the requested prefix.

Steps to reproduce:

Mapping

    DELETE eg
    POST eg
    {
       "mappings": {
          "eg": {
             "properties": {
                "complete": {
                   "type": "completion"
                }
             }
          }
       },
       "settings": {
          "index": {
             "number_of_shards": "1"
          }
       }
    }

Data

    put eg/eg/1
    { "complete": ["head"] }
    put eg/eg/2
    { "complete": ["heading"] }
    put eg/eg/3
    { "complete": ["header"] }

Query

    POST eg/_search
    {
        "suggest": {
            "autocomplete": {
                "prefix": "headi",
                "completion": {
                    "field": "complete",
                    "fuzzy": {
                        "fuzziness": 1
                    }
                }
            }
        }
    }

Results

    {
    ...
    "autocomplete": [
       {
          "text": "headi",
          "offset": 0,
          "length": 5,
          "options": [
             {
                "text": "head",
                "_score": 4,
                ...
             },
             {
                "text": "header",
                "_score": 4,
                ...
             },
             {
                "text": "heading",
                "_score": 4,
                ...
             }
          ]
       }
    ]
    }

:SearcSuggesters >bug

Source

gabriel-letarte

👍18

Most helpful comment

I "fixed" this problem partially with a query and partially in code (JS). To continue the "Bonn" example;

First I use 2 suggest query elements with a max size for each of 5 items. One element does a fuzzy search, the other a none fuzzy search. When I get the result (max 10 items) I filter them out with JS, where I store the exact results before the fuzzy results. Also I filter out duplicates.

{
    'suggest': {
        'autocomplete_fuzzy': {
            'prefix': 'Bonn',
            'completion': {
                'field': 'suggest',
                'fuzzy': { 'fuzziness': 2 },
                'size': 5
            }
        },
        'autocomplete': {
            'prefix': 'Bonn',
            'completion': {
                'field': 'suggest',
                'size': 5
            }
        },
    },
}

results.suggest.autocomplete[0].options.forEach(function (suggestion) {
  ids.push(suggestion._id);
  suggestions.push(suggestion);
});

results.suggest.autocomplete_fuzzy[0].options.forEach(function (suggestion) {
  if (ids.indexOf(suggestion._id) == -1) {
    suggestions.push(suggestion);
  }
});

I hope it helps.

sanderlissenburg on 23 Apr 2019

👍5

All 12 comments

Reproduced as well.

I think expected behavior should be that all things being equal, a result item's score should be the length of the longest "exact" matching prefix, regardless of fuzziness parameter setting, when that value is higher than prefix length - fuzziness

One side effect of this is that if I query for promo and I have two documents in my index, prom and promo, the latter should be scored higher and come back as first result (which is more intuitive).

One can then control tie breaking or otherwise interacting with this logic using the index-time weight as usual

adamhadani on 23 Jan 2018

cc @elastic/es-search-aggs

jimczi on 19 Mar 2018

currently having the exact same issue, any word on this?

PepijnDeWachter on 10 Sep 2018

Can reproduce using 6.2.3:

    "version": {
        "number": "6.2.3",
        "build_hash": "c59ff00",
        "build_date": "2018-03-13T10:06:29.741383Z",
        "build_snapshot": false,
        "lucene_version": "7.2.1",
        "minimum_wire_compatibility_version": "5.6.0",
        "minimum_index_compatibility_version": "5.0.0"
    },

hope this gets fixed soon as it's becoming a big headache for our stakeholders and users.

cleentfaar on 30 Oct 2018

Same here. For instance search for Bonn, but Bohlen get returned before Bonn.

Query

{
    "suggest": {
        "name-fuzzy-suggest" : {
            "prefix" : "Bonn", 
            "completion" : { 
                "field" : "suggest",
                "fuzzy": {"fuzziness": 2}
            }
        }
}

Result

{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": 0,
        "hits": []
    },
    "suggest": {
        "name-fuzzy-suggest": [
            {
                "text": "Bonn",
                "offset": 0,
                "length": 4,
                "options": [
                    {
                        "text": "Bohlen",
                        "_index": "test_it_1540912066",
                        "_type": "test_it",
                        "_id": "e53963c2-d904-5486-b4c0-9cd7c6faf497",
                        "_score": 2,
                        "_source": {
                            "name": "Böhlen",
                            "suggest": {
                                "input": [
                                    "Böhlen",
                                    "Bohlen"
                                ]
                            }
                        }
                    },
                    {
                        "text": "Boizenburg",
                        "_index": "test_it_1540912066",
                        "_type": "test_it",
                        "_id": "ab36d626-be7d-5ee1-a36c-7417ed5a6896",
                        "_score": 2,
                        "_source": {
                            "name": "Boizenburg",
                            "suggest": {
                                "input": [
                                    "Boizenburg",
                                    "Boizenburg"
                                ]
                            }
                        }
                    },
                    {
                        "text": "Bonn",
                        "_index": "test_it_1540912066",
                        "_type": "test_it",
                        "_id": "8fe4dbd2-0806-5bda-8602-19ad4757f11a",
                        "_score": 2,
                        "_source": {
                            "name": "Bonn",
                            "suggest": {
                                "input": [
                                    "Bonn",
                                    "Bonn"
                                ]
                            }
                        }
                    },
                    {
                        "text": "Bonnigheim",
                        "_index": "test_it_1540912066",
                        "_type": "test_it",
                        "_id": "4902127c-5422-5c2b-a596-97408fadbae5",
                        "_score": 2,
                        "_source": {
                            "name": "Bönnigheim",
                            "suggest": {
                                "input": [
                                    "Bönnigheim",
                                    "Bonnigheim"
                                ]
                            }
                        }
                    }
                ]
            }
        ]
}

sanderlissenburg on 30 Oct 2018

Using 6.5.3 and still got the issue. It doesn’t look good. Any work around?

narmadham on 8 Feb 2019

same with 6.6, this issue can be traced back to 2014 (https://github.com/elastic/elasticsearch/issues/7060)

jalberto on 28 Feb 2019

👍1

same with 5.6
seems to be a lucene issue. does anybody know if this is already patched, or is there maybe a workaround?
thank you

crisk on 23 Apr 2019

I "fixed" this problem partially with a query and partially in code (JS). To continue the "Bonn" example;

{
    'suggest': {
        'autocomplete_fuzzy': {
            'prefix': 'Bonn',
            'completion': {
                'field': 'suggest',
                'fuzzy': { 'fuzziness': 2 },
                'size': 5
            }
        },
        'autocomplete': {
            'prefix': 'Bonn',
            'completion': {
                'field': 'suggest',
                'size': 5
            }
        },
    },
}

results.suggest.autocomplete[0].options.forEach(function (suggestion) {
  ids.push(suggestion._id);
  suggestions.push(suggestion);
});

results.suggest.autocomplete_fuzzy[0].options.forEach(function (suggestion) {
  if (ids.indexOf(suggestion._id) == -1) {
    suggestions.push(suggestion);
  }
});

I hope it helps.

sanderlissenburg on 23 Apr 2019

👍5

For prefix queries that involve fuzziness, the completion suggester finds all minimal prefix path that intersect with the suggestions and compute a boost per path that is equal to the length of the shared prefix. For instance the prefix headi will find head as the minimal prefix that matches heading, head and header with a fuzziness of 1 and hea with a fuzziness of 2. Since suggestions are always visited in order of their weight we cannot compute the boost based on the final input, it is always computed from the minimal prefix. I agree that it can be misleading but this limitation is needed to ensure that queries always return the best weights while remaining efficient.
One possible workaround is to run multiple suggestion queries, one per fuzziness value and to rerank the result client side. This will be more efficient than trying to assign specific boost per output so I am closing this issue.

jimczi on 23 Apr 2019

👍1

I've worked around this by using the weight attribute on the suggest
field, every time a user selects a value from the completion suggester
and submits the form in which the suggester was used I increment the
weight attribute.

If you're already using the weight attribute for something else this
might not work in your case, but in my case that was no issue.

Using this workaround means I don't need to combine multiple responses
and only do one query.

The biggest downside with this method is that when you don't have any
'usage' data the problem isn't solved, depending on how many values you
have in the suggester and if there is an expected popularity curve in
the data you might not find a benefit in this workaround.

PepijnDeWachter on 23 Apr 2019

👍1

Quick lodash solution for obtaining unique suggestions when running multiple suggestions queries:

const { autocomplete, autocomplete_fuzzy } = responseObj.data.suggest;
const mergedSuggestions = _.concat(
  autocomplete[0].options,
  autocomplete_fuzzy[0].options
).map(suggestion => {
  _.unset(suggestion, "_score");
  return suggestion;
});
const uniqueSuggestions = _.uniqWith(mergedSuggestions, _.isEqual);