Elasticsearch: Multiple tokens on LHS in stemmer_override rules

Created on 4 May 2020 · 7Comments · Source: elastic/elasticsearch

Without looking into internals of stemmer_override I assumed it works similarly to synonym token filter (and translates given mapping rules into SynonymMap in the same way), which seems not to be the case:

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "synonyms": {
          "type": "synonym",
          "synonyms": [
            "reading => read",
            "swimming, swims => swim"
          ]
        },
        "stems": {
          "type": "stemmer_override",
          "rules": [
            "reading => read",
            "swimming, swims => swim"
          ]
        }
      }
    }
  }
}

Simple rules, with single token on LHS, work the same (so both synonyms and stems will output read for reading) but rules with multiple tokens on LHS (also known as "contraction rules") do not:

SYNONYMS

GET test/_analyze
{
  "text": "swimming",
  "tokenizer": "standard", 
  "filter": ["synonyms"]
}

output:

{
  "tokens": [
    {
      "token": "swim",
      "start_offset": 0,
      "end_offset": 8,
      "type": "SYNONYM",
      "position": 0
    }
  ]
}

STEMS

GET test/_analyze
{
  "text": "swimming",
  "tokenizer": "standard", 
  "filter": ["stems"]
}

output

{
  "tokens": [
    {
      "token": "swimming",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

There's of course a simple workaround for my use case (expanding contraction rules into a sequence of single token mapping rules) but the user experience is bad IMO.

Although there is no place in documentation that would mention that "contraction rules" are supported in stemmer override token filter I find this behavior confusing. I would rather prefer a verbose error at filter registration to "silent failure" at analysis time. But to be honest, I think that ideally stemmer_override should support contraction rules the same way as synonym token filter does.

:SearcAnalysis >enhancement Search

Source

telendt

All 7 comments

Pinging @elastic/es-search (:Search/Analysis)

elasticmachine on 4 May 2020

If you're ok with adding this feature (support for rules with multiple tokens on LHS into stemmer override token filter so that they work similarly to contraction rules in synonym token filter) then I can prepare a PR for that.

This looks like a low-risk and low-effort change to me, although this issue is also probably a low priority 🤷‍♂️

telendt on 5 May 2020

@telendt Thank you for submitting this issue.
What's happening is that the LSH of the rule is saved as is. So if you don't do the tokenization, your rule gets applied:

GET test/_analyze
{
  "text": "swimming, swims",
  "tokenizer": "keyword", 
  "filter": ["stems"]
}

returns

{
    "tokens": [
        {
            "token": "swim",
            "start_offset": 0,
            "end_offset": 15,
            "type": "word",
            "position": 0
        }
    ]
}

It indeed would be nice for stemmer_override filter to support contraction rules, but I think this should be done on Lucene side. elasticsearch passes rules to the underlying Lucene filters, and it is up to Lucene filters to process these rules. So I would suggest to submit an issue in Lucene Jira.
I will be closing this issue on the elasticsearch side.

mayya-sharipova on 5 May 2020

@mayya-sharipova:

What's happening is that the LSH of the rule is saved as is [...]

Yes, but only because you chose to do so.

AFAIK Solr's StemmerOverrideFilterFactory accepts tab separated dictionary file. There's no confusion there as the format is different than synonyms mapping format (comma separated with =>). You chose similar format (with =>) and thus the confusion.

Is it common to stem tokens containing comas? If not I don't see why:

swimming, swims => swim

could not result in:

builder.add("swimming", "swim");
builder.add("swims", "swim");

telendt on 5 May 2020

👍1

Yes, but only because you chose to do so.

Agreed, the parsing is done in the factory that is defined in Elasticsearch so the decision is ours. We don't need to change anything in Lucene. I don't have a strong opinion regarding the decision but we shouldn't accept bad rules silently. We should validate that the left side is a single term or accept a list of terms but I agree that the current situation is confusing so I am reopening the issue.