Elasticsearch: using size restriction on suggest returns unexpected number of documents

Created on 7 Feb 2017  路  13Comments  路  Source: elastic/elasticsearch

Elasticsearch version:
5.1.1, 5.1.2, 5.2

Plugins installed: [None]

JVM version: 1.8.0_77-b03

OS version: Red Hat Enterprise 7.3 & Ubuntu 16.04

Description of the problem including expected versus actual behavior:
We have a set of documents added to a completion index.
e.g. {"fingerprint":"all dam star","suggestions":[{"input":"all star damen","weight":1800}]}
We have min. 10 Documents that would match for a suggest (http://host:9200/test/_suggest)
e.g. {"suggest": {"prefix": "al", "completion": {"size": 10,"field": "suggestions"}}}
but not all 10 documents are returned.

Steps to reproduce:

  1. Unzip
    suggest_bug.zip
    to local folder.
  2. Create an index with the mappings:
    curl -XPUT -s -o /dev/null http://localhost:9200/test/ --data-binary @mapping_beauty.json
  3. Add the test data:
    curl -XPUT -s -o /dev/null http://localhost:9200/_bulk --data-binary @test.json.kaputt

    1. Make the suggest call:

      curl -XPOST -s http://localhost:9200/test/_suggest --data-binary {"suggest": {"prefix": "all","completion": {"size": 10,"field": "suggestions"}}}

this returns 9 instead of 10 suggestion.

(There is a suggest_test.sh script to automate this. Just set a a link ln -sf test.json test.json.kaputt and call the script.)

We added another file with test data. In principal it's the same data, but with a different weight for one row. If you use this data instead, it will return 10 suggestions. (ln -sf test.json test.json.heil && sh suggest_test.sh)

Cheers Josh

:SearcSuggesters >feature Search

Most helpful comment

Multiple suggestions per document are used to indicate variations of the same entity (eg The Beatles and Beatles) , not distinct entities like hotel-1 and hotel-2

All 13 comments

@tunichgud it is much easier if you provide a simple recreation, reduced to the minimum example that demonstrates the problem, instead of making me try to understand your test code first.

Here's a recreation:

PUT christians_test
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "keywordLowercase": {
          "filter": [
            "lowercase"
          ],
          "type": "custom",
          "tokenizer": "keyword"
        }
      }
    }
  },
  "mappings": {
    "completion": {
      "properties": {
        "suggestions": {
          "type": "completion",
          "analyzer": "keywordLowercase",
          "preserve_separators": true,
          "preserve_position_increments": true,
          "max_input_length": 50
        }
      }
    }
  }
}


POST _bulk
{"index":{"_index":"christians_test","_type":"completion","_id":1}}
{"suggestions":[{"input":"all star damen","weight":1800}]}
{"index":{"_index":"christians_test","_type":"completion","_id":6}}
{"suggestions":[{"input":"all-in-one pc","weight":155500},{"input":"pc all in one","weight":155501},{"input":"all in one pc","weight":155502}]}


POST /christians_test/_search
{
  "_source": false,
  "suggest": {
    "foo": {
      "prefix": "al",
      "completion": {
        "size": 2,
        "field": "suggestions"
      }
    }
  }
}

The following only returns document 2. It appears that the size parameter counts the number of matched inputs, rather than the number of matched documents.

Hi @clintongormley ,
good to hear you were able to reproduce the problem. We had a rough time finding it and reproducing it with 10 documents. So sorry for being not able to provide a better simplification of the problem.

Btw.:
Did you really execute a query against "_search"? We were only able to reproduce the problem using "_suggest".

Cheers

Did you really execute a query against "_search"? We were only able to reproduce the problem using "_suggest".

The _suggest API has been deprecated in favour of using _search

This is because of grouping done based on docIDs inside CompletionSuggester-->TopDocumentsCollector. The NRTSuggester gives back 10 suggestions, however among these 10, 2 of the suggestions belong to the same document and hence are grouped together. That's why the output has 9 suggestions instead of 10.

Curious to know why do we require grouping based on docIDs, as I see this is creating lot of problems.

I'm only curious to get one suggest per document, and N documents. I'm not curious about how i get there.

I'm only curious to get one suggest per document, and N documents.

This is my expectation too

I feel one suggest per document will be a problem. Shouldn't N be the no of suggestions instead of N documents. Consider below example:

"Doc1"
{
  "suggestions": [
    {
      "input": "Hotel-1"
    },
    {
      "input": "Hotel-2"
    }
  ]
}

"Doc2"
{
  "suggestions": [
    {
      "input": "Hotel-3"
    },
    {
      "input": "Hotel-4"
    }
  ]
} 

In the above a suggest query with "prefix: Hotel-" should return all four suggestion i.e. Hotel-1, Hotel-2, Hotel-3, Hotel-4. But if we suggest one per document then the output might be Hotel-1, Hotel-3 and discards Hotel-2, Hotel-4, which will be wrong with respect to suggestion.

"Doc1"
{
  "suggestions": [
    {
      "input": "Hotel-1",
      "weight": 10
    },
    {
      "input": "Hotel-2",
      "weight": 5
    }
  ]
}

"Doc2"
{
  "suggestions": [
    {
      "input": "Hotel-3",
      "weight": 6
    },
    {
      "input": "Hotel-4",
      "weight": 19
    }
  ]
} 

The weight is used locally and globally.
Globally to define the sorting of the suggest, and locally the winner to display per document.

in my example this would lead to the result documents "Hotel-4" and "Hotel-1".

Otherwise we would not need to group those inputs per document in the first place.

Multiple suggestions per document are used to indicate variations of the same entity (eg The Beatles and Beatles) , not distinct entities like hotel-1 and hotel-2

@clintongormley got it.

cc @elastic/es-search-aggs

The completion suggester is document based but it doesn't deduplicate suggestions that comes from the same document. The result is that the size option refers to a number of suggestions and not a number of documents. In practice most of the documents have a single variation so it's not an issue but if you define multiple suggestions for the same document that share some prefixes then you might run into this problem. We discussed internally and decided to add an heuristic to deduplicate suggestions when we count the number of results. We cannot know in advance the number of duplicates in the index so we would need to visit more paths (suggestions) in order to limit the number of duplicates. The number of extra paths that we can visit could be limited by a new option (shard_size for instance) in order to keep the search fast.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

malpani picture malpani  路  3Comments

ppf2 picture ppf2  路  3Comments

dadoonet picture dadoonet  路  3Comments

clintongormley picture clintongormley  路  3Comments

dawi picture dawi  路  3Comments