Hi
Elasticsearch sort doesn't provide correct order for some Persian characters like گ چ پ ژ.
Pinging @elastic/es-search-aggs
@mhmen it would be helpful if you could provide a short, complete reproduction that indexes documents then highlights the incorrect behavior. Please also let us know the version of elasticsearch you are using.
I had to install fonts to see the characters from the example, in case anyone else doesn't have the fonts, these are:
| Unicode character | Oct | Dec | Hex | HTML |
|--------------------------|----------|-------|-------|-------|
| arabic letter gaf | 03257 | 1711 | 0x6AF | گ|
| arabic letter tcheh | 03206 | 1670 | 0x686 | چ|
| arabic letter peh | 03176 | 1662 | 0x67E | پ|
| arabic letter jeh | 03230 | 1688 | 0x698 | ژ|
If it helps, a small reproduction looks like the following, in an empty cluster with the analysis-icu plugin:
PUT /i
{
"mappings": {
"_doc": {
"dynamic": "strict",
"properties": {
"f": {
"type": "keyword",
"fields": {
"sort_fa_IR": {
"country": "IR",
"language": "fa",
"type": "icu_collation_keyword",
"index": false
}
}
}
}
}
},
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
}
}
# 200 OK
# {
# "shards_acknowledged": true,
# "acknowledged": true,
# "index": "i"
# }
PUT /i/_doc/_bulk?refresh
{"index":{}}
{"f":"ژ"}
{"index":{}}
{"f":"پ"}
{"index":{}}
{"f":"چ"}
{"index":{}}
{"f":"گ"}
# 200 OK
GET /_search
{
"query": {
"match_all": {}
},
"sort": "f.sort_fa_IR"
}
# 200 OK
# {
# "took": 8,
# "_shards": {
# "skipped": 0,
# "successful": 1,
# "total": 1,
# "failed": 0
# },
# "timed_out": false,
# "hits": {
# "max_score": null,
# "total": 4,
# "hits": [
# {
# "_type": "_doc",
# "_score": null,
# "_source": {
# "f": "پ"
# },
# "_id": "qa8iVmUBDLtXL047cQBS",
# "_index": "i",
# "sort": [
# "A†倀\u0001"
# ]
# },
# {
# "_type": "_doc",
# "_score": null,
# "_source": {
# "f": "چ"
# },
# "_id": "qq8iVmUBDLtXL047cQBS",
# "_index": "i",
# "sort": [
# "ጠA†倀\u0001"
# ]
# },
# {
# "_type": "_doc",
# "_score": null,
# "_source": {
# "f": "ژ"
# },
# "_id": "qK8iVmUBDLtXL047cQBR",
# "_index": "i",
# "sort": [
# "ፀA†倀\u0001"
# ]
# },
# {
# "_type": "_doc",
# "_score": null,
# "_source": {
# "f": "گ"
# },
# "_id": "q68iVmUBDLtXL047cQBS",
# "_index": "i",
# "sort": [
# "፠A†倀\u0001"
# ]
# }
# ]
# }
# }
The resulting order is the same as the order of the unicode code points, which is presumably incorrect, but it'd be awfully helpful to know what the expected order of these documents is.
Thank you for your replies
The resulting sort on the 4th comment is correct.
But the problem I say is that for example the Unicode character 1662 in Persian alphabetical sort is located between 1576 and 1578, and this puts the character 1662 after the other two if sort is done by the Unicode code points.
The characters' alphabetical order is as below:
1662 is between 1576 and 1578,
1670 is between 1580 and 1581,
1688 is between 1586 and 1587,
1711 is between 1705 and 1604
and 1711 is the other form of Arabic letter kaf with Unicode of 1603.
I don't know how it is handled, but when I use .net OrderBy method it works perfect for Persian alphabet.
This Wikipedia article gets a good idea of what the Persian alphabet looks like:
https://en.wikipedia.org/wiki/Persian_alphabet
Such problem exists in java I think
https://stackoverflow.com/questions/43497368/sorting-arraylist-in-alphabetical-orderpersian
@mhmen It would be very helpful if you follow the format of filing a new issue and provided us with steps to reproduce your problem with showing where is the bug and your expected output, so we would not guess what is the problem. Because so far, unfortunately, we are not able to identify what is the problem.
But the problem I say is that for example the Unicode character 1662 in Persian alphabetical sort is located between 1576 and 1578, and this puts the character 1662 after the other two if sort is done by the Unicode code points.
The characters' alphabetical order is as below:
1662 is between 1576 and 1578,
So, I've got three characters:
پ - 1662 - pe
1576 - ب - be
1578 - ت - te
As you noted according to the rules of persian alphabet, they should be sorted as : 1576 - ب - be, 1662 - pe, 1578 - ت - te. And that is the exact order that you will get if you use "icu_collation_keyword" as @DaveCTurner suggested. So your solution is just to use "icu_collation_keyword" .
PUT /i
{
"mappings": {
"_doc": {
"dynamic": "strict",
"properties": {
"f": {
"type": "keyword",
"fields": {
"sort_fa_IR": {
"country": "IR",
"language": "fa",
"type": "icu_collation_keyword",
"index": false
}
}
}
}
}
},
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
}
}
PUT /i/_doc/_bulk?refresh
{ "index" : { "_index" : "i", "_type" : "_doc", "_id" : "1" } }
{"f": "ب"}
{ "index" : { "_index" : "i", "_type" : "_doc", "_id" : "2" } }
{"f": "ت"}
{ "index" : { "_index" : "i", "_type" : "_doc", "_id" : "3" } }
{"f": "پ"}
GET i/_search
{
"query": {
"match_all": {}
},
"sort": "f.sort_fa_IR"
}
{
"took": 77,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "mayyatest0",
"_type": "_doc",
"_id": "1",
"_score": null,
"_source": {
"f": "ب"
},
"sort": [
"ጏA†倀"
]
},
{
"_index": "mayyatest0",
"_type": "_doc",
"_id": "3",
"_score": null,
"_source": {
"f": "پ"
},
"sort": [
"A†倀"
]
},
{
"_index": "mayyatest0",
"_type": "_doc",
"_id": "2",
"_score": null,
"_source": {
"f": "ت"
},
"sort": [
"ጕA†倀"
]
}
]
}
}
If we are missing something, please let us know with the exact steps to reproduce the problem.
Thank you very much
It seems it's working.
Another question is can I change the mapping in this way without reindexing the data indexed before?
@mhmen you are welcome.
can I change the mapping in this way without reindexing the data indexed before?
No, it is not possible. You can add another field as a part of multi-fields (as shown in the example) which will affect only new inserted documents; for previously inserted documents this new field will be empty.
Closing this ticket, as it is not an issue!
Most helpful comment
@mhmen It would be very helpful if you follow the format of filing a new issue and provided us with steps to reproduce your problem with showing where is the bug and your expected output, so we would not guess what is the problem. Because so far, unfortunately, we are not able to identify what is the problem.
So, I've got three characters:
پ - 1662 - pe
1576 - ب - be
1578 - ت - te
As you noted according to the rules of persian alphabet, they should be sorted as : 1576 - ب - be, 1662 - pe, 1578 - ت - te. And that is the exact order that you will get if you use
"icu_collation_keyword"as @DaveCTurner suggested. So your solution is just to use"icu_collation_keyword".If we are missing something, please let us know with the exact steps to reproduce the problem.