Elasticsearch: predicate_token_filter : The token.getPosition() method return wrong value

Created on 27 Sep 2019  路  4Comments  路  Source: elastic/elasticsearch

Elasticsearch version :
7.3.2 dockerized
image: docker.elastic.co/elasticsearch/elasticsearch:7.3.2

Plugins installed: none

JVM version (java -version):

openjdk version "12.0.2" 2019-07-16
OpenJDK Runtime Environment (build 12.0.2+10)
OpenJDK 64-Bit Server VM (build 12.0.2+10, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):

Linux 7f94601adc38 4.20.7-042007-generic #201902061234 SMP Wed Feb 6 17:36:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
I'm using a predicate_token_filter to keep only the first X token of a stream. For this I use this filter configuration :

{ "myPredicatefilter": { "type": "predicate_token_filter", "script": { "source": "token.getPosition() <= 1" } } }

But every time I use this analyzer it seems the position of the tokens are increasing, and after a few calls, the filter does not produce any token.

Here is a video showing the problem :

Steps to reproduce:

Complete index settings :

PUT issue-predicate-token-filter
{
  "settings": {
    "analysis": {
      "filter": {
        "myPredicatefilter": {
          "type": "predicate_token_filter",
          "script": {
            "source": "token.getPosition() <= 1"
          }
        }
      },
      "analyzer": {
        "myPredicateAnalyzer": {
          "filter": [
            "myPredicatefilter"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }
}

analyze request :

POST issue-predicate-token-filter/_analyze
{
  "analyzer": "myPredicateAnalyzer",
  "text": "pain grill茅"
}

first result :

{
  "tokens" : [
    {
      "token" : "pain",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "grill茅",
      "start_offset" : 5,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    }
  ]
}


second result and all call afterward :

{
  "tokens" : [ ]
}

The analyzer can work again for a call after a _close / _open in the index. And also if I use explain : true in the analyze request the analyzer works without any problem.

POST issue-predicate-token-filter/_analyze
{
  "analyzer": "myPredicateAnalyzer",
  "text": "pain grill茅",
  "explain": true
}

You can see the weird behavior by adding a Debug.explain in the filter script

PUT issue-predicate-token-filter-with-debug
{
  "settings": {
    "analysis": {
      "filter": {
        "myPredicatefilter": {
          "type": "predicate_token_filter",
          "script": {
            "source": "Debug.explain(token.getPosition())"
          }
        }
      },
      "analyzer": {
        "myPredicateAnalyzer": {
          "filter": [
            "myPredicatefilter"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }
}
POST issue-predicate-token-filter-with-debug/_analyze
{
  "analyzer": "myPredicateAnalyzer",
  "text": "pain grill茅",
  "explain": false
}

You will see the token.getPosition() value increasing after each call.

:SearcAnalysis >bug

Most helpful comment

Yes, we need to reset the Token state when the filter is reset - I'll open a PR.

All 4 comments

Pinging @elastic/es-search

Looks like a reset is missing in ScriptFilteringTokenFilter . @romseygeek can you take a look ?

Yes, we need to reset the Token state when the filter is reset - I'll open a PR.

Thanks for this fast resolution!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

abtpst picture abtpst  路  3Comments

makeyang picture makeyang  路  3Comments

dadoonet picture dadoonet  路  3Comments

rjernst picture rjernst  路  3Comments

rpalsaxena picture rpalsaxena  路  3Comments