Elasticsearch: bucket_sort aggregation misplace the first bucket

Created on 6 Dec 2018 · 11Comments · Source: elastic/elasticsearch

Elasticsearch version 6.5.1

Plugins installed: []

JVM version JVM 1.8.0_192

OS version Debian 8.11

Description of the problem including expected versus actual behavior:

Actual behavior :
When sorting an aggregation with a bucket_sort based on its _count, if doc_count are equals the item that should be in first position is the last one, other items are in the right order.
Expected :
Every items should be in the right order, including the first one.

Steps to reproduce:

Create a basic index :

curl -XPUT 'http://localhost:9200/messages' -H 'Content-Type: application/json' -d '
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas" : 0
  },
  "mappings": {
    "user": {
      "properties": {
        "rank": {
          "type": "keyword"
        }
      }
    }
  }
}'

Insert one user with a different rank each time (from a to d) :

user with rank a

curl -XPUT 'http://localhost:9200/messages/user/001' -H 'Content-Type: application/json' -d '
{
  "@timestamp": "2018-12-06T01:00:00+01:00",
  "rank": "a"
}'

user with rank b

curl -XPUT 'http://localhost:9200/messages/user/002' -H 'Content-Type: application/json' -d '
{
  "@timestamp": "2018-12-06T01:00:00+01:00",
  "rank": "b"
}'

user with rank c

curl -XPUT 'http://localhost:9200/messages/user/003' -H 'Content-Type: application/json' -d '
{
  "@timestamp": "2018-12-06T01:00:00+01:00",
  "rank": "c"
}'

user with rank d

curl -XPUT 'http://localhost:9200/messages/user/004' -H 'Content-Type: application/json' -d '
{
  "@timestamp": "2018-12-06T01:00:00+01:00",
  "rank": "d"
}'

Aggregate on the rank property and sort by count with a bucket_sort :

curl -XGET 'http://localhost:9200/messages/_search?pretty=true' -H 'Content-Type: application/json' -d '
    {
      "size": 0,
      "aggregations": {
        "top_by_color": {
          "composite": {
            "size": 1000,
            "sources": [
              {
                "rank": {
                  "terms": {
                    "field": "rank",
                    "missing_bucket": true,
                    "order": "asc"
                  }
                }
              }
            ]
          },
          "aggregations": {
            "top_bucket_sort": {
              "bucket_sort": {
                "sort": [
                  {
                    "_count": {
                      "order": "asc"
                    }
                  }
                ],
                "size": 1000
              }
            }
          }
        }
      }
    }'

Aggregation result is :

    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 4,
        "max_score": 0,
        "hits": [

        ]
      },
      "aggregations": {
        "top_by_color": {
          "after_key": {
            "rank": "d"
          },
          "buckets": [
            {
              "key": {
                "rank": "b"
              },
              "doc_count": 1
            },
            {
              "key": {
                "rank": "c"
              },
              "doc_count": 1
            },
            {
              "key": {
                "rank": "d"
              },
              "doc_count": 1
            },
            {
              "key": {
                "rank": "a"
              },
              "doc_count": 1
            }
          ]
        }
      }
    }

We have ranks sorted has : b, c, d, a instead of a, b, c, d.

:AnalyticAggregations >feature good first issue help wanted

Source

odespesse

Most helpful comment

We discussed this offline and we agreed that we should switch to a List and use Collections#sort. This should preserve the order for equal values and simplify the code since we don't really need a priority queue.

jimczi on 10 Dec 2018

👍2

All 11 comments

Pinging @elastic/es-analytics-geo

elasticmachine on 7 Dec 2018

The sort order is based on count and there is no guarantee that equals element will not be reordered as a result of the sort. We could change to a stable sort or change the documentation to explain how the sort works but why are you using the composite aggregation and the bucket sort instead of a terms aggregation sorted by _count ? In the latter case you could add a tiebreaker to your sort based on the _key to make sure that equals count are sorted based on the value of the term ?

jimczi on 7 Dec 2018

This is a simplified version of my real use case.

I need to aggregate 3 different fields together. I tried with terms but the result of 3 nested aggregations is a pain to parse and the composite aggregation seems to be the best solution for this scenario.

What seems strange to me is that on every attempts I tried, everything is sorted as expected except the first one which is misplaced at the end. It is like if the internal ElasticSearch "loop" that sort every buckets do the right thing except for the first item.

odespesse on 7 Dec 2018

It is like if the internal ElasticSearch "loop" that sort every buckets do the right thing except for the first item.

I didn't look carefully the implementation of the priority queue we use to perform the sort but as I said in my previous comment the output of the sort is correct. Though I wonder why we use a priority queue rather than a list and then Collections#sort which provides stable sort.
@dimitris-athanasiou any reason to prefer a priority queue here ? I understand that it can be faster if the buckets are truncated but that seems an edge case ?

jimczi on 7 Dec 2018

@jimczi Indeed, the reason a priority queue was used was because it seemed suitable for also dealing with trimmed buckets in an efficient way. I vaguely recall discussing this and deciding it was a desirable trade-off against sort stability.

dimitris-athanasiou on 7 Dec 2018

jimczi on 10 Dec 2018

👍2

I would like to work on it.

govi20 on 11 Dec 2018

I would like to work on this issue if @govi20 is not working on it.

SivagurunathanV on 18 Dec 2018

Looks like someone just sent a PR for this (#36748)

If you're still interested in contributing, you can generally just leave a note that says "I'm going to start working on this" then raise a PR when you are ready. No need to get approval first.

Thanks for helping out! We have more "help wanted" and "good first issue" tickets if you want to look around for other bugs that need fixing :)

$polyfractal picture$ polyfractal on 18 Dec 2018

can we still work on issue or raise PR if there is already a PR raised.

Razi007 on 18 Dec 2018

I would avoid working on issues that have an active PR. Sometimes a PR might go dormant (contributor gets busy and abandons the PR, technical issues make it close, etc), in which case we'll close the PR and can be worked on again.

But if the PR is active and being worked on, it's probably best to choose a different issue to work on.

$polyfractal picture$ polyfractal on 18 Dec 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings