Elasticsearch: Remove the term suggester

Created on 6 Nov 2016 · 9Comments · Source: elastic/elasticsearch

The term suggester is much less useful than the phrase suggester as it just considers each term independently, while the phrase suggester looks at co-occurring terms.

I see people using the term suggester, but I wonder if this is just because the phrase suggester configuration looks more intimidating. Perhaps we should improve the phrase suggester and remove the term suggester.

Thoughts?

:SearcSuggesters >deprecation Search help wanted

Source

clintongormley

Most helpful comment

it is about making good suggestions by taking the association between words into account.

Sometimes that behavior is not desired, hence the use of the term suggester.

Without a shingled field, the phrase suggester falls back to behaving like the term suggester.

I think the issue is the information returned and the response format it is delivered. I'll try to do my best explaining with the following example:

PUT test/doc/1
{
  "food": "apple apricot banana bread beer carrot candy"
}

GET test/_suggest
{
  "term_suggest": {
    "text": "carot bananna",
    "term": {
      "field": "food"
    }
  },
  "phrase_suggest": {
    "text": "carot bananna",
    "phrase": {
      "field": "food"
    }
  }
}

{
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "phrase_suggest": [
    {
      "text": "carot bananna",
      "offset": 0,
      "length": 13,
      "options": [
        {
          "text": "carot banana",
          "score": 0.123355635
        },
        {
          "text": "carrot bananna",
          "score": 0.12118797
        }
      ]
    }
  ],
  "term_suggest": [
    {
      "text": "carot",
      "offset": 0,
      "length": 5,
      "options": [
        {
          "text": "carrot",
          "score": 0.8,
          "freq": 1
        }
      ]
    },
    {
      "text": "bananna",
      "offset": 6,
      "length": 7,
      "options": [
        {
          "text": "banana",
          "score": 0.8333333,
          "freq": 1
        }
      ]
    }
  ]
}

In the above example both terms are misspelled and there is no desire to have the terms be related (for example a dev implementing Google's "did you mean?" style behavior). The big differences for myself as someone implementing suggestions in an app are as follows:

the individual term frequencies are not returned in the phrase suggester
in the phrase suggester since the terms are not called out separately along with separate offset/length information it makes it difficult to when rendering information back to the user in the app to put links, hover overs, and other information the term that was suggested. Highlighting could be turned on in the phrase suggester, but then that has to be parsed and modified which is not as easy as having the length/offsets supplied
the suggested text from the phrase gives me two options, but neither option is desired as each one has a bad suggested term

Unless I'm misunderstanding the extent to which the phrase suggester would be modified (I assumed it would be the behavior, but not the response format) it difficult for me to see how a modified phrase suggester being able to solve the same problems as the term suggester. Again, sorry if I'm missing some big item here or not.

djschny on 18 Nov 2016

👍3

All 9 comments

Perhaps we should improve the phrase suggester and remove the term suggester.

At this point the phrase suggester effectively degrades into the term suggester if you don't set up the appropriate mappings. We could look for ways to make that degradation perform as well as the phrase suggester.

In 5.0 I improved the docs for the phrase suggester a bunch so we have an example of the mapping which should help.

nik9000 on 7 Nov 2016

I did see cases where the term suggester was being used to just see a part of all terms in a field. I don't know how to do that else without relying on aggregations.

martijnvg on 7 Nov 2016

The term suggester solves a different problem than the phrase suggester. For example when somebody wants to implement the "did you mean" kind of behavior and word position does not matter. So keeping is important IMO.

Sorry if I'm missing it but I don't understand what the problem is or what is to be gained by removing the term suggester?

djschny on 12 Nov 2016

👍2

@djschny it is not about word position, it is about making good suggestions by taking the association between words into account. The term suggester just can't do that. Without a shingled field, the phrase suggester falls back to behaving like the term suggester.

Sorry if I'm missing it but I don't understand what the problem is or what is to be gained by removing the term suggester?

It's extra code (with open bugs) that can be removed. I'd rather focus on making the phrase suggester better than fixing a redundant feature.

clintongormley on 18 Nov 2016

Discussed in FixItFriday: let's deprecate the term suggester in 5.x for removal in 6.0, and work on improving the API of the phrase suggester.

clintongormley on 18 Nov 2016

it is about making good suggestions by taking the association between words into account.

Sometimes that behavior is not desired, hence the use of the term suggester.

Without a shingled field, the phrase suggester falls back to behaving like the term suggester.

I think the issue is the information returned and the response format it is delivered. I'll try to do my best explaining with the following example:

PUT test/doc/1
{
  "food": "apple apricot banana bread beer carrot candy"
}

GET test/_suggest
{
  "term_suggest": {
    "text": "carot bananna",
    "term": {
      "field": "food"
    }
  },
  "phrase_suggest": {
    "text": "carot bananna",
    "phrase": {
      "field": "food"
    }
  }
}

{
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "phrase_suggest": [
    {
      "text": "carot bananna",
      "offset": 0,
      "length": 13,
      "options": [
        {
          "text": "carot banana",
          "score": 0.123355635
        },
        {
          "text": "carrot bananna",
          "score": 0.12118797
        }
      ]
    }
  ],
  "term_suggest": [
    {
      "text": "carot",
      "offset": 0,
      "length": 5,
      "options": [
        {
          "text": "carrot",
          "score": 0.8,
          "freq": 1
        }
      ]
    },
    {
      "text": "bananna",
      "offset": 6,
      "length": 7,
      "options": [
        {
          "text": "banana",
          "score": 0.8333333,
          "freq": 1
        }
      ]
    }
  ]
}

the individual term frequencies are not returned in the phrase suggester
in the phrase suggester since the terms are not called out separately along with separate offset/length information it makes it difficult to when rendering information back to the user in the app to put links, hover overs, and other information the term that was suggested. Highlighting could be turned on in the phrase suggester, but then that has to be parsed and modified which is not as easy as having the length/offsets supplied
the suggested text from the phrase gives me two options, but neither option is desired as each one has a bad suggested term

djschny on 18 Nov 2016

👍3

Sometimes that behavior is not desired, hence the use of the term suggester.

When would this behaviour not be desired?

In the above example both terms are misspelled and there is no desire to have the terms be related (for example a dev implementing Google's "did you mean?" style behavior).

This is exactly when you want the behaviour of the phrase suggester, not the term suggester. The phrase suggester returns meaningful suggestions.

the suggested text from the phrase gives me two options, but neither option is desired as each one has a bad suggested term

That's because these suggestions only work with statistically significant amounts of data, not just toy examples. Also, if you use shingles (combined with real world amounts of data) you get much better suggestions. Btw, if you set max_errors to 2 (defaults to 1) then the first suggestion is the correctly spelled carrot banana. This is what I mean about improving the phrase suggester.

clintongormley on 19 Nov 2016

Based on experiments that I run, term suggester is more relevant for one-term searches than phrase one. I made a mistake by typing r instead of t so I asked both suggesters to show what they got. As you can see, term suggester nailed it, while phrase is miles away from something useful. I also tried to switch phrase suggester to a simple field from shingled one (regress) but it was no help at all. I've run multiple single-term searches like kia, tesla and term suggester again was correct that term is good while phrase was giving some crazy things like ia instead of kia even that I have min_word_length = 3 and tells instead of tesla. I understand that with some huge dataset it might not be a problem but we work with what we have approx. 8MM documents.

{
  "phrase":[
    {
      "text":"oarh",
      "offset":0,
      "length":4,
      "options":[
        {
          "text":"each",
          "highlighted":"<em>each</em>",
          "score":0.012882852,
          "collate_match":true
        },
        {
          "text":"sarah",
          "highlighted":"<em>sarah</em>",
          "score":0.010489283,
          "collate_match":true
        },
        {
          "text":"oprah",
          "highlighted":"<em>oprah</em>",
          "score":0.009785955,
          "collate_match":true
        },
        ...
      ]
    }
  ],
  "term":[
    {
      "text":"oarh",
      "offset":0,
      "length":4,
      "options":[
        {
          "text":"oath",
          "score":0.75,
          "freq":1811
        },
        {
          "text":"oanh",
          "score":0.75,
          "freq":30
        },
        ...
      ]
    }
  ]
}