Elasticsearch: Korean tokenizer (Nori) doesn't split digits and letters

Created on 5 Sep 2019  ยท  7Comments  ยท  Source: elastic/elasticsearch

Describe the feature

Elasticsearch version (bin/elasticsearch --version):

6.7.2

Plugins installed: []

  • nori

JVM version (java -version):

jvm 1.8

OS version (uname -a if on a Unix-like system):

ubuntu 16.04

Description of the problem including expected versus actual behavior:

I wanted to analyze 44์‚ฌ์ด์ฆˆ๋น„ํ‚ค๋‹ˆ(44 size bikini :) ) so I execute the following script.

Steps to reproduce

  1. I check the Nouns ์‚ฌ์ด์ฆˆ(size) and ๋น„ํ‚ค๋‹ˆ(bikini). I saw these nouns are NNP and NNG
  2. So, I compose of these words to '44์‚ฌ์ด์ฆˆ๋น„ํ‚ค๋‹ˆ', and I send to analyze with nori plugin.

Elastic Settings And Analysis Result

Mapping Result

{
  "articles-alpha" : {
    "settings" : {
      "index" : {
        "number_of_shards" : "5",
        "provided_name" : "articles-alpha",
        "creation_date" : "1567669131498",
        "analysis" : {
          "analyzer" : {
            "korean" : {
              "filter" : [
                "lowercase",
              ],
              "type" : "custom",
              "tokenizer" : "nori_user_dict_tokenizer"
            }
          },
          "tokenizer" : {
            "nori_user_dict_tokenizer" : {
              "mode" : "mixed",
              "type" : "nori_tokenizer",
              "user_dictionary" : "nori/dict-service-noun"
            }
          }
        },
        "number_of_replicas" : "1"
      }
    }
  }
}

input

GET /articles-alpha/_analyze
{
  "text": "44์‚ฌ์ด์ฆˆ๋น„ํ‚ค๋‹ˆ",
  "analyzer": "korean",
  "explain": true
}

Output

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "nori_user_dict_tokenizer",
      "tokens" : [
        {
          "token" : "44์‚ฌ์ด์ฆˆ๋น„ํ‚ค๋‹ˆ",
          "start_offset" : 0,
          "end_offset" : 8,
          "type" : "word",
          "position" : 0,
          "bytes" : "[34 34 ec 82 ac ec 9d b4 ec a6 88 eb b9 84 ed 82 a4 eb 8b 88]",
          "leftPOS" : "UNKNOWN(Unknown)",
          "morphemes" : null,
          "posType" : "MORPHEME",
          "positionLength" : 1,
          "reading" : null,
          "rightPOS" : "UNKNOWN(Unknown)",
          "termFrequency" : 1
        }
      ]
   }
}

:SearcAnalysis >bug Search

All 7 comments

Pinging @elastic/es-search

Thanks for reporting @drake-jin . This is clearly a regression caused by https://issues.apache.org/jira/browse/LUCENE-8548. I opened https://issues.apache.org/jira/browse/LUCENE-8966 to fix this since digits should be not be grouped with other types of characters.

@jimczi Could I ask what version is patched?

thanks always.

Could I ask what version is patched?

The fix will be released in Lucene 8.3 so it should be available for a 7.x version of Elasticsearch. The sooner would be Elasticsearch 7.6 but there is no guarantee here.

It's also, Doesn't split English Letters(Alphabets) and Digit...

PUT /test
{
  "number_of_shards" : "5",
  "analysis" : {
    "analyzer" : {
      "korean" : {
        "type" : "custom",
        "tokenizer" : "nori_user_dict_tokenizer"
      }
    },
    "tokenizer" : {
      "nori_user_dict_tokenizer" : {
        "mode" : "mixed",
        "type" : "nori_tokenizer"
      }
    }
  }
}
GET /test/_analyze
{
  "text": ["foo3", "Foo3", "FOO3"],
  "tokenizer": "nori_user_dict_tokenizer"
}

image

@jimczi

It seems this issue is resolved!! So, You can close this issue.
Good job.

# It's tested ES Version 7.7.1
curl -X POST "http://localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "nori",
  "text": "44์‚ฌ์ด์ฆˆ๋น„ํ‚ค๋‹ˆ"
}
'
{
  "tokens" : [
    {
      "token" : "44",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "์‚ฌ์ด์ฆˆ",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "๋น„ํ‚ค๋‹ˆ",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    }
  ]
}

Thanks @drake-jin

Was this page helpful?
0 / 5 - 0 ratings

Related issues

makeyang picture makeyang  ยท  3Comments

dawi picture dawi  ยท  3Comments

clintongormley picture clintongormley  ยท  3Comments

ttaranov picture ttaranov  ยท  3Comments

matthughes picture matthughes  ยท  3Comments