Elasticsearch version (bin/elasticsearch --version):
6.7.2
Plugins installed: []
JVM version (java -version):
jvm 1.8
OS version (uname -a if on a Unix-like system):
ubuntu 16.04
Description of the problem including expected versus actual behavior:
I wanted to analyze 44์ฌ์ด์ฆ๋นํค๋(44 size bikini :) ) so I execute the following script.
์ฌ์ด์ฆ(size) and ๋นํค๋(bikini). I saw these nouns are NNP and NNG Mapping Result
{
"articles-alpha" : {
"settings" : {
"index" : {
"number_of_shards" : "5",
"provided_name" : "articles-alpha",
"creation_date" : "1567669131498",
"analysis" : {
"analyzer" : {
"korean" : {
"filter" : [
"lowercase",
],
"type" : "custom",
"tokenizer" : "nori_user_dict_tokenizer"
}
},
"tokenizer" : {
"nori_user_dict_tokenizer" : {
"mode" : "mixed",
"type" : "nori_tokenizer",
"user_dictionary" : "nori/dict-service-noun"
}
}
},
"number_of_replicas" : "1"
}
}
}
}
input
GET /articles-alpha/_analyze
{
"text": "44์ฌ์ด์ฆ๋นํค๋",
"analyzer": "korean",
"explain": true
}
Output
{
"detail" : {
"custom_analyzer" : true,
"charfilters" : [ ],
"tokenizer" : {
"name" : "nori_user_dict_tokenizer",
"tokens" : [
{
"token" : "44์ฌ์ด์ฆ๋นํค๋",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0,
"bytes" : "[34 34 ec 82 ac ec 9d b4 ec a6 88 eb b9 84 ed 82 a4 eb 8b 88]",
"leftPOS" : "UNKNOWN(Unknown)",
"morphemes" : null,
"posType" : "MORPHEME",
"positionLength" : 1,
"reading" : null,
"rightPOS" : "UNKNOWN(Unknown)",
"termFrequency" : 1
}
]
}
}
Pinging @elastic/es-search
Thanks for reporting @drake-jin . This is clearly a regression caused by https://issues.apache.org/jira/browse/LUCENE-8548. I opened https://issues.apache.org/jira/browse/LUCENE-8966 to fix this since digits should be not be grouped with other types of characters.
@jimczi Could I ask what version is patched?
thanks always.
Could I ask what version is patched?
The fix will be released in Lucene 8.3 so it should be available for a 7.x version of Elasticsearch. The sooner would be Elasticsearch 7.6 but there is no guarantee here.
It's also, Doesn't split English Letters(Alphabets) and Digit...
PUT /test
{
"number_of_shards" : "5",
"analysis" : {
"analyzer" : {
"korean" : {
"type" : "custom",
"tokenizer" : "nori_user_dict_tokenizer"
}
},
"tokenizer" : {
"nori_user_dict_tokenizer" : {
"mode" : "mixed",
"type" : "nori_tokenizer"
}
}
}
}
GET /test/_analyze
{
"text": ["foo3", "Foo3", "FOO3"],
"tokenizer": "nori_user_dict_tokenizer"
}

@jimczi
It seems this issue is resolved!! So, You can close this issue.
Good job.
# It's tested ES Version 7.7.1
curl -X POST "http://localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
"analyzer": "nori",
"text": "44์ฌ์ด์ฆ๋นํค๋"
}
'
{
"tokens" : [
{
"token" : "44",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "์ฌ์ด์ฆ",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 1
},
{
"token" : "๋นํค๋",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 2
}
]
}
Thanks @drake-jin