Elasticsearch: Sort term aggregation with nested aggregation in order path

Created on 27 Feb 2016  ·  44Comments  ·  Source: elastic/elasticsearch

Right now the below aggregation is not possible even though the 'nested_agg' does return a single bucket and nested aggregation is a single bucket aggregation.

Below sample aggregation generates an error message that says 'nested_agg' is not a single bucket aggregation and can not be in the order path.

{
buckets: {
terms: {
  field: 'docId',
  order: {'nested_agg>sum_value': 'desc'}
},
aggs: {
  nested_agg: {
    nested: {
      path: 'my_nested_object'
    },
    aggs: {
      sum_value: {
        sum: {field: 'my_nested_object.value'}
      }
    }
   }
  }
 }
}
:AnalyticAggregations >bug high hanging fruit

Most helpful comment

@clintongormley Nested architecture is an important functionality in ES. Most companies build atleast at minimum some sort of functionality with nested mappings. This bug renders ES useless. Any updates?

All 44 comments

@colings86 any thoughts on this?

I managed to reproduce this on the master branch and now know why this is happening but I don't have a solution as to how we can fix it short of just documenting that you can't order by an aggregation within a nested aggregation.

The issue is that the NestedAggregatorFactory.createInternal() method calls AggregatorFactory.asMultiBucketAggregator(). This creates a wrapper around the NestedAggregator that will create a separate instance of NestedAggregator for each parent bucket. We do this in the NestedAggregator to ensure the doc ids are delivered in order because with a single instance and multi-valued nested fields we could get documents not in order. Some of the aggregations rely on the fact that documents are collected in order. For example, we could collect (doc1, bucket1), (doc2, bucket1), (doc1, bucket2), (doc2, bucket2) which would be out of order, so by having separate instances we are guaranteeing docId order since each instance will only collect one bucket.

I tried to change the AggregationPath.validate() method to use the underlying aggregator (the first instance of it at least) but then it fails later because we need to retrieve the value from the aggregator and there is no way of getting the value from a particular instance form the wrapper.

I managed to somehow face this issue again. The "path" parameter in moving average can not point to a nested aggregation because nested aggregation is not a single bucket aggregation.

Is there any way to get around this 'issue'?? I'm running into the same issues

@clintongormley @colings86 Any update on this please? We are badly stuck without this...

+1 for this as well. Very big use case scenario for us.

+1 as this is showstoper for us to upgrade from es v1 to es v2

We were bitten by the same thing. FWIW, we worked around it temporarily by ordering the aggregations in the application code after they are returned from ES.

This missing feature is preventing us from upgrading to ES 2.X. Is there any plans to support this in the near future?

@clintongormley Nested architecture is an important functionality in ES. Most companies build atleast at minimum some sort of functionality with nested mappings. This bug renders ES useless. Any updates?

+1 sorting after the fact in our application isn't a viable option due to number of results.

I have made a fix to sort which has nested aggregations in path. Also you might have multi-value buckets in path (you should just specify bucket key in path like "colors.red>stats.variance").
I might create a pull request or just give a link to the commit in fork of ES 5.1.2 if anyone is interested.

That would be great, or link in your fork?

Op di 20 dec. 2016 20:32 schreef idozorenko notifications@github.com:

I have made a fix to sort which has nested aggregations in path. Also you
might have multi-value buckets in path (you should just specify bucket key
in path like "colors.red>stats.variance").
I might create a pull request or just give a link to the commit in fork of
ES 5.1.2 if anyone is interested.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/16838#issuecomment-268335539,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AH4yOnC_e-o0zAsgVPxT6MWYR8jdUBwNks5rKC03gaJpZM4HkeIN
.

I might create a pull request

:+1:

As I'm not a contributor, I will just share my commit to ES 5.1 branch here. Please let me know if you have any questions.

https://github.com/elastic/elasticsearch/commit/8f601a3c241cb652a889870d93fd32b3d226ef41

@idozorenko feel free to submit a PR so that we can review the code - thanks

Does this problem also occur in reverse_nested aggs? (not direct nested)

yes

+1 for this issue. We have exact same problem and same query is running with AWS ES 1.5.

Can any one tell us about the status for this bug fix? This is very critical feature for us and can not move forward without this functionality? Does any one suggest to use ES 1.5 instead of 5.X version? (Personally i do not think we should do this)

We didn't want to wait for a fix or to upgrade so we ended up restructuring
our data to be a parent/child relationship vs nested. So far so good.

On Mon, May 8, 2017 at 5:01 AM, akashmpatel91 notifications@github.com
wrote:

+1 for this issue. We have exact same problem and same query is running
with AWS ES 1.5.

Can any one tell us about the status for this bug fix? This is very
critical feature for us and can not move forward without this
functionality? Does any one suggest to use ES 1.5 instead of 5.X version?
(Personally i do not think we should do this)


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/16838#issuecomment-299837273,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABAejttN4mvEjco7X7BmwNBNVOOoopaQks5r3vX7gaJpZM4HkeIN
.

@brettahale, Please note that "parent-child relations can make queries hundreds of times slower" as per ES documentation https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-search-speed.html.

We did POC and it is correct, we are handling 5-6 billions documents and query is taking 6-7 sec to return results. With nested document query is returning results in 400 ms.

Agreed, wasn't ideal but we were able to deliver our feature. I'd like to
see a fix here as well but after a chat with ES support, it sounded like a
foundational change that wasn't likely going to get fixed anytime soon.

On Mon, May 8, 2017 at 1:19 PM, akashmpatel91 notifications@github.com
wrote:

@brettahale https://github.com/brettahale, Please note that
"parent-child relations can make queries hundreds of times slower" as per
ES documentation https://www.elastic.co/guide/en/elasticsearch/reference/
master/tune-for-search-speed.html.

We did POC and it is correct, we are handling 5-6 billions documents and
query is taking 6-7 sec to return results. With nested document query is
returning results in 400 ms.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/16838#issuecomment-299963988,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABAeju3EHsROM1hWyII36f5pPpPeVK0bks5r32rQgaJpZM4HkeIN
.

Hello team, any update on this?

Our workaround was to copy some fields from the parent docs into its nested docs, so we can still make terms aggregation on the nested docs but sort by fields "found" in the parent docs. This allowed us to omit the reverse_nested aggregation between the bucket and the sub bucket. Not ideal, but works in our cases. Of course, a fix would be much appreciated.

Edit:
Apparently, there was no need for a workaround in my case.
See the comment below.

@colings86: Actually I did managed to use Terms aggregation, a sub Reverse_Nested aggregation and a sub Cardinality aggregation for ordering. Something like that:

{
    "aggs" : {
        "AllMovieNames" : {
            "terms" : { "field" : "MovieName" },
            "order": {
            "backToActors>distinctActorsCount":"desc"
            },
           "size": 10
        },
        "aggs":
        {
            "backToActors":{        
                "reverse_nested":{},
                "aggs":{
                    "distinctActorsCount":{
                        "cardinality":{
                            "field":"ActorName"
                        }
                    }
                }           
            }   
        }           
    }   
}

No exception message was thrown, and the order was just as expected.

Are you sure the problem occur in both Nested and Reverse_Nested sub aggregation?
I'm using ElasticSearch 5.3.2.

@IdanWo sorry, actually you are right, this problem doesn't occur on the reverse_nested aggregation, the only single bucket aggregation it should affect is the nested aggregation because thats the only single bucket aggregation that uses AggregatorFactory.asMultiBucketAggregator()

I'm another victim of this insidious bug:

Situation:

  • Document ROOT with two nested documents NESTED1 and NESTED2.
  • Term aggregation over a field in ROOT.NESTED1 (nested aggregation -to NESTED1- then term aggregation)
  • Sum aggregation over a field in ROOT.NESTED2 (inside the previous term aggregation, reverse nested aggregation -back to ROOT-, nested aggregation -to NESTED2- then sum aggregation)

I cannot use the sum aggregation to sort the term aggregation because an error is thrown saying that the nested aggregation -to NESTED2- does not returns a single-bucket aggregation

Can someone update us with the status of this bug?

(I'm using ElasticSearch 5.4)

Hello Team, any update on this Bug fix? It is not working only in 5.x version. Can you please provide ETA for this bug to be fixed? This is very important feature for term aggregation and blocking many clients.

Hi All, I was able to get around this by doing something similar to the following. For this, there was only one nested value that will match the interval condition - but you could get creative :)

Mapping:

    "trendingpopularityjson": {
            "type": "nested",
            "include_in_parent": true,
            "properties": {
              "interval": {
                "type": "integer"
              },
              "trendingpopularity": {
                "type": "integer"
              }
            }
          }

Aggregation to sum inside. This would avoid the nested aggregation - making it easy :

                    "Trend": {
                      "sum": {
                        "script": {
                          "inline": "def d = doc['trendingpopularityjson.interval']; for (int i = 0; i < d.length; ++i) { if (d[i] == params.interval) { return doc['trendingpopularityjson.trendingpopularity'][i] } }",
                          "params": {
                            "interval": 2
                          },
                          "lang": "painless"
                        }
                      }
                    }

@IdanWo could you post your full query? The syntax in your message doesn't seem correct. Thanks.

@colings86 Is reverse_nested buggy because it doesn't use AggregatorFactory.asMultiBucketAggregator() thus allowing collecting docs out of order? What aggregations rely on documents being collected in order?

@mattweber sorry its taken a while to reply to this. @martijnvg and I spoke about whether the reverse_nested would suffer from the same out of order problem and we noted that the reason the nested aggregator has this issue is because it uses an iterator rather than a bit-set for the child filter in NestedAggregator. ReverseNestedAggregator uses a bit-set for its parent filter so can cope with random access (i.e. doc ids out of order) so it doesn't need to use AggregatorFactory.asMultiBucketAggregator().

Thank you @colings86!

Would it be possible for NestedAggregator to use a bit-set so we can remove this limitation? If no, can you give an example aggregation that relies on doc order?

For my use case, I need to calculate a metric against a nested field and then sort a terms aggregation by it. It does not rely on a specific order. I have simply forked the nested aggregation into a plugin and removed the asMultiBucketAggregator so it acts like reverse_nested. No issues with my specific use-case that I have found so far.

@mattweber Any nested aggregation which is a sub-aggregation to a terms aggregation which is working on a multi-value field could exhibit this out of order problem because we would process all the nested docs for the first terms before we then reprocess the nested documents for the second term on the document with multiple values for this field.

As an example imagine an index with 1 root docs which has 2 nested docs which both contain the terms foo and bar for fieldA. Because we store nested docs before the root doc in the index the root document would have lucene doc id 2 and the child docs would have lucene doc ids 0 and 1.

Now if we perform a terms aggregation on fieldA with a sub aggregation from of type nested we would do the following:

  1. collect doc 2 in the terms aggregation.
  2. The terms aggregation would get the first value foo and create a bucket for it with bucket ordinal 0.
  3. The nested sub-aggregation would be called for doc id 2 and bucket ordinal 0.
  4. the nested aggregation would collect document 0 for bucket ordinal 0
  5. the nested aggregation would collect document 1 for bucket ordinal 0
  6. The nested aggregation would return and the terms aggregation would move to the next value
  7. The terms aggregation would get the second value bar and create a bucket for it with bucket ordinal 1.
  8. The nested sub-aggregation would be called for doc id 2 and bucket ordinal 1.
  9. the nested aggregation would collect document 0 for bucket ordinal 1
  10. the nested aggregation would collect document 1 for bucket ordinal 1

This is a problem because in the nested aggregator we have collected document 0 then document 1 and then returned to document 0 and document 1 again (sorry if this was already clear from my comment before but I thought this was a better way of explaining it).

We actually spoke today about a possible solution where we should be able to stop using AggregatorFactory.asMultiBucketAggregator() in the nested aggregation (as so would solve this issue) by essentially temporarily caching the bucket ordinals needed for each root doc as they are received by the nested aggregator. we can do this because we know that the nested aggregator will be called for all relevant bucket ordinals for a particular root document before moving onto the next root document. If we did this we would still be able to have the iterator for the child filter but make sure that we always use it in streaming manner rather than a random access manner. This should hopefully have a minimal effect on performance. We still need to dig into this idea and see if it has the potential we hope but hopefully it will be what we hope and we can solve this issue soon.

Just for clarity as it might have gotten lost in the wall of text above, the problem with out of order docs would occur if you next a nested aggregation under a multi-bucket aggregation which is working on a multi-value field (a field where at least one doc has multiple values for the field). Noted that if you have removed asMultiBucketAggregator this could cause problems in any multi-bucket aggregation if the above condition is met, not just the terms aggregation, and not just when you order by the nested aggregation. If your use case only performs nested aggregations on single value fields it should work though obviously in ES itself we need to be more generic.

Yes it does @colings86, thank you for the explanation. I see #26683, thank you and @martijnvg for moving forward on this!

Not sure about the branching model here, would this merge be in v5.6.2?

@martijnvg Thanks for moving ahead with a fix. For people who are hitting this issue now, could you clarify what merging to master means? Will this be backported to 5.x or 6.x? Just having a quick look at the ES branches, and it looks like master, 5.x and 6.x are all diverged. Not sure if I can just run a build from master (I'm assuming 7.x is not fully baked?), if I can try to merge this commit back to a 5.x branch, or if I'm just SOL for now and need to come up with an alternative mapping solution. :)

@pithyless The change to nested aggregation has been back ported to the 6.x branch and will be available in ES 6.1. As this change is not a simple fix, it is not be backported to the 6.0 branch (due to feature freeze) and 5.6 branch (only bug fixes).

@martijnvg Thanks for clarifying this

All,
We found an alternative solution for this issue. We can copy the field on which we need to apply term aggregation to nested document and then we can apply term aggregation on the nested document. This is giving us the right result and everything works fine. Please find sample document below

Using the bucket_filter we can also achieve pagination and filtering with term aggregation query which is another limitation by ElasticSearch

Document mapping


{
  "properties": {
    "aggregatedData": {
      "type": "nested",
      "properties": {
        "termAggreagtionWillBeAppliedOnThisField": {
          "type": "double"
        },
        "currencyUnit": {
          "type": "keyword"
        }  
    },
    "thisIsFieldInOuterDocumentOnWhichWeNeedToApplyTermAggregation": {
      "type": "keyword"
    },
    "id": {
      "type": "keyword"
    },
    "customerId": {
      "type": "keyword"
    }
  }
}

Sample query

{
  "query" : {
    "constant_score" : {
      "filter" : {
        "bool" : {
          "must" : [
            {
              "term" : {
                "customerId" : {
                  "value" : "1",
                  "boost" : 1.0
                }
              }
            }
          ]
        }
      },
      "boost" : 1.0
    }
  },
  "_source" : false,
  "aggregations" : {
    "nestedAggregation" : {
      "nested" : {
        "path" : "aggregatedData"
      },
      "aggregations" : {
        "nestedFilter" : {
          "filter" : {
            "bool" : {
              "must" : [
                {
                  "term" : {
                    "currencyCode" : {
                    "value" : "XYZ"
                    }
                  }
                }
              ],
              "disable_coord" : false,
              "adjust_pure_negative" : true,
              "boost" : 1.0
            }
          },
          "aggregations" : {
            "termAggreagtionWillBeAppliedOnThisField" : {
              "terms" : {
                "field" : "aggregatedData.termAggreagtionWillBeAppliedOnThisField",
                "size" : 2,
                "order" : [
                  {
                    "_term" : "desc"
                  }
                ]
              },
              "aggregations" : {
                "customerCount" : {
                  "sum" : {
                    "field" : "aggregatedData.customerNumber"
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "ext" : { }
}

有什么好消息吗?

@akashmpatel91 You and your solution are life savior. I was looking for the solution since last few months and we decided to go with ES 2.x because of this issue. Now we can easily move to ES 5.x by keeping document model similar to yours.

Bunch of thanks to you from our entire team.

Was this page helpful?
0 / 5 - 0 ratings