Elasticsearch version 6.5.1
Plugins installed: []
JVM version JVM 1.8.0_192
OS version Debian 8.11
Description of the problem including expected versus actual behavior:
bucket_sort based on its _count, if doc_count are equals the item that should be in first position is the last one, other items are in the right order.Steps to reproduce:
curl -XPUT 'http://localhost:9200/messages' -H 'Content-Type: application/json' -d '
{
"settings": {
"number_of_shards": 1,
"number_of_replicas" : 0
},
"mappings": {
"user": {
"properties": {
"rank": {
"type": "keyword"
}
}
}
}
}'
user with a different rank each time (from a to d) :user with rank acurl -XPUT 'http://localhost:9200/messages/user/001' -H 'Content-Type: application/json' -d '
{
"@timestamp": "2018-12-06T01:00:00+01:00",
"rank": "a"
}'
user with rank bcurl -XPUT 'http://localhost:9200/messages/user/002' -H 'Content-Type: application/json' -d '
{
"@timestamp": "2018-12-06T01:00:00+01:00",
"rank": "b"
}'
user with rank ccurl -XPUT 'http://localhost:9200/messages/user/003' -H 'Content-Type: application/json' -d '
{
"@timestamp": "2018-12-06T01:00:00+01:00",
"rank": "c"
}'
user with rank dcurl -XPUT 'http://localhost:9200/messages/user/004' -H 'Content-Type: application/json' -d '
{
"@timestamp": "2018-12-06T01:00:00+01:00",
"rank": "d"
}'
rank property and sort by count with a bucket_sort :curl -XGET 'http://localhost:9200/messages/_search?pretty=true' -H 'Content-Type: application/json' -d '
{
"size": 0,
"aggregations": {
"top_by_color": {
"composite": {
"size": 1000,
"sources": [
{
"rank": {
"terms": {
"field": "rank",
"missing_bucket": true,
"order": "asc"
}
}
}
]
},
"aggregations": {
"top_bucket_sort": {
"bucket_sort": {
"sort": [
{
"_count": {
"order": "asc"
}
}
],
"size": 1000
}
}
}
}
}
}'
Aggregation result is :
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"top_by_color": {
"after_key": {
"rank": "d"
},
"buckets": [
{
"key": {
"rank": "b"
},
"doc_count": 1
},
{
"key": {
"rank": "c"
},
"doc_count": 1
},
{
"key": {
"rank": "d"
},
"doc_count": 1
},
{
"key": {
"rank": "a"
},
"doc_count": 1
}
]
}
}
}
We have ranks sorted has : b, c, d, a instead of a, b, c, d.
Pinging @elastic/es-analytics-geo
The sort order is based on count and there is no guarantee that equals element will not be reordered as a result of the sort. We could change to a stable sort or change the documentation to explain how the sort works but why are you using the composite aggregation and the bucket sort instead of a terms aggregation sorted by _count ? In the latter case you could add a tiebreaker to your sort based on the _key to make sure that equals count are sorted based on the value of the term ?
This is a simplified version of my real use case.
I need to aggregate 3 different fields together. I tried with terms but the result of 3 nested aggregations is a pain to parse and the composite aggregation seems to be the best solution for this scenario.
What seems strange to me is that on every attempts I tried, everything is sorted as expected except the first one which is misplaced at the end. It is like if the internal ElasticSearch "loop" that sort every buckets do the right thing except for the first item.
It is like if the internal ElasticSearch "loop" that sort every buckets do the right thing except for the first item.
I didn't look carefully the implementation of the priority queue we use to perform the sort but as I said in my previous comment the output of the sort is correct. Though I wonder why we use a priority queue rather than a list and then Collections#sort which provides stable sort.
@dimitris-athanasiou any reason to prefer a priority queue here ? I understand that it can be faster if the buckets are truncated but that seems an edge case ?
@jimczi Indeed, the reason a priority queue was used was because it seemed suitable for also dealing with trimmed buckets in an efficient way. I vaguely recall discussing this and deciding it was a desirable trade-off against sort stability.
We discussed this offline and we agreed that we should switch to a List and use Collections#sort. This should preserve the order for equal values and simplify the code since we don't really need a priority queue.
I would like to work on it.
I would like to work on this issue if @govi20 is not working on it.
Looks like someone just sent a PR for this (#36748)
If you're still interested in contributing, you can generally just leave a note that says "I'm going to start working on this" then raise a PR when you are ready. No need to get approval first.
Thanks for helping out! We have more "help wanted" and "good first issue" tickets if you want to look around for other bugs that need fixing :)
can we still work on issue or raise PR if there is already a PR raised.
I would avoid working on issues that have an active PR. Sometimes a PR might go dormant (contributor gets busy and abandons the PR, technical issues make it close, etc), in which case we'll close the PR and can be worked on again.
But if the PR is active and being worked on, it's probably best to choose a different issue to work on.
Most helpful comment
We discussed this offline and we agreed that we should switch to a List and use
Collections#sort. This should preserve the order for equal values and simplify the code since we don't really need a priority queue.