Elasticsearch: Bucket script aggregation returns invalid value for missing docs

Created on 14 Nov 2017 · 14Comments · Source: elastic/elasticsearch

Elasticsearch version (bin/elasticsearch --version): 6.0.0-rc1

JVM version (java -version):
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
OS version (uname -a if on a Unix-like system):
Linux elasticsearch-data-hot-003 4.11.0-1013-azure #13-Ubuntu SMP Mon Oct 2 17:59:06 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
When doing a bucket script aggregation that depends on a cumulative sum aggregation of another sum aggregation, if the sum aggregation returns null values (Because there are no docs in that time interval bucket), the bucket script aggregation will also return null, instead of relying on the cumulative sum value that was gathered so far.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

Add docs that span over 5 minutes that look like this:

{
  @timestamp: '',
  bytes: 100
}

Run a query that spans after the 5m end (meaning that there will be date histogram buckets without docs), with this aggregation:

{
"aggs": {
    "timeseries": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "1m",
        "min_doc_count": 0,
        "time_zone": "UTC"
      },
      "aggs": {
        "sum_bytes": {
          "sum": {
            "field": "bytes"
          }
        },
        "cumulative_bytes": {
          "cumulative_sum": {
            "buckets_path": "sum_bytes"
          }
        },
        "bucket": {
          "bucket_script": {
            "buckets_path": {
              "bytes": "cumulative_bytes"
            },
            "script": {
              "source": "params.bytes",
              "lang": "painless"
            }
          }
        }
      }
    }
  }
}

Check the response and see that in the date histogram without buckets, the bucket aggregation does not show the value that its supposed to (The cumulative_bytes value)

:AnalyticAggregations >bug Analytics help wanted high hanging fruit

Source

shaharmor

Most helpful comment

Hi @shaharmor, I'm afraid there is still no progress on this issue. It's still on our radar, but we don't currently have anyone working on it at this time. We'll update this ticket when there's movement.

polyfractal on 7 Mar 2019

👍2

All 14 comments

@shaharmor Did you try setting a missing value on the date_histogram aggregation? Github issues are meant for bugs and feature requests and this sounds more like a question that should be asked on the forum first.

martijnvg on 14 Nov 2017

I'm not sure how a missing value on the date_histogram would help, as there are no docs to add the missing field to...

Anyway, I created a forum thread as well: https://discuss.elastic.co/t/bucket-script-fails-when-some-docs-are-missing/107592/1

shaharmor on 14 Nov 2017

This is actually a bug.

It seems that the bucket_script aggregation is not executed on buckets that have a doc count of zero. The following recreation script highlights this:

PUT test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "bytes": {
          "type": "long"
        }
      }
    }
  }
}

PUT test/doc/1
{
  "@timestamp": "2017-01-01T00:00:00",
  "bytes": 100
}

PUT test/doc/2
{
  "@timestamp": "2017-01-01T00:05:00",
  "bytes": 100
}

GET test/_search
{
  "size": 0,
  "aggs": {
    "timeseries": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "1m",
        "min_doc_count": 0,
        "time_zone": "UTC"
      },
      "aggs": {
        "sum_bytes": {
          "sum": {
            "field": "bytes"
          }
        },
        "cumulative_bytes": {
          "cumulative_sum": {
            "buckets_path": "sum_bytes"
          }
        },
        "bucket": {
          "bucket_script": {
            "buckets_path": {
              "bytes": "cumulative_bytes"
            },
            "script": {
              "source": "params.bytes",
              "lang": "painless"
            }
          }
        }
      }
    }
  }
}

The response from the search request is:

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "timeseries": {
      "buckets": [
        {
          "key_as_string": "2017-01-01T00:00:00.000Z",
          "key": 1483228800000,
          "doc_count": 1,
          "sum_bytes": {
            "value": 100
          },
          "cumulative_bytes": {
            "value": 100
          },
          "bucket": {
            "value": 100
          }
        },
        {
          "key_as_string": "2017-01-01T00:01:00.000Z",
          "key": 1483228860000,
          "doc_count": 0,
          "sum_bytes": {
            "value": 0
          },
          "cumulative_bytes": {
            "value": 100
          }
        },
        {
          "key_as_string": "2017-01-01T00:02:00.000Z",
          "key": 1483228920000,
          "doc_count": 0,
          "sum_bytes": {
            "value": 0
          },
          "cumulative_bytes": {
            "value": 100
          }
        },
        {
          "key_as_string": "2017-01-01T00:03:00.000Z",
          "key": 1483228980000,
          "doc_count": 0,
          "sum_bytes": {
            "value": 0
          },
          "cumulative_bytes": {
            "value": 100
          }
        },
        {
          "key_as_string": "2017-01-01T00:04:00.000Z",
          "key": 1483229040000,
          "doc_count": 0,
          "sum_bytes": {
            "value": 0
          },
          "cumulative_bytes": {
            "value": 100
          }
        },
        {
          "key_as_string": "2017-01-01T00:05:00.000Z",
          "key": 1483229100000,
          "doc_count": 1,
          "sum_bytes": {
            "value": 100
          },
          "cumulative_bytes": {
            "value": 200
          },
          "bucket": {
            "value": 200
          }
        }
      ]
    }
  }
}

In the response above the buckets with keys 2017-01-01T00:01:00.000Z to 2017-01-01T00:04:00.000Z should have a sub-aggregation bucket whose value should be 100 (the same as the cumulative_sum aggregation since the script just outputs that value).

colings86 on 20 Nov 2017

Hi @colings86 Could you please provide some instructions to fix this? Thanks in advance!

liketic on 22 Nov 2017

I just look a look at the code around this and I think its going to need some thought around how we can fix this bug without adversely affecting other pipeline aggregations. The problem arises when we resolve the value for the buckets_path for the empty bucket in BucketHelpers.resolveBucketValue here (https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/search/aggregations/pipeline/BucketHelpers.java#L174):

if (Double.isInfinite(value) || Double.isNaN(value) || (bucket.getDocCount() == 0 && !isDocCountProperty)) {
                    switch (gapPolicy) {
                    case INSERT_ZEROS:
                        return 0.0;
                    case SKIP:
                    default:
                        return Double.NaN;
                    }
                } else {
                    return value;
                }

That if statement return Double.NaN (because the gap policy is SKIP) for any empty buckets, where an empty bucket is defined as a bucket with a doc count of zero.

In BucketScriptPipelineAggregator.reduce() we don't execute on buckets which return Double.NaN so the bucket is skipped.

BucketHelpers.resolveBucketValue() is used by all of the pipeline aggregations so we need to be careful that any changes do not adversely affect the other pipeline aggregations.

We could potentially change the bucket_script aggregation so it executes on all buckets regardless of the retrieved value but we also pass the doc count of the current bucket to the script so it can determine what value to output but I'm worried that this might make this aggregation too unwieldly for the user.

I'll mark this issue as discuss for now as I think we need to decide on an approach before it is implemented.

colings86 on 23 Nov 2017

Discussed in FIxItFriday and we decided that this is requires more in depth discussion within the Search and Aggregations team

colings86 on 24 Nov 2017

Discussed with the search and aggs team and we decided that we should pass the actual value to the script (instead of converting it to NaN or 0.0 using the gap policy and also pass the doc count of the bucket to the script so the user writing the script can know if the bucket is empty and decide how to interpret the value

colings86 on 27 Nov 2017

We should ensure that the solution to this issue also fixes the issue in https://github.com/elastic/elasticsearch/issues/27544

colings86 on 28 Nov 2017

@colings86 Is this something that is being actively worked on? Any ETA?

shaharmor on 18 Dec 2017

👍1

@shaharmor its not being actively worked on right now but since we now have a way forward on this it is now available to be picked up and worked on. There is no ETA for this currently.

colings86 on 19 Dec 2017

@elastic/es-search-aggs

colings86 on 13 Mar 2018

For anyone stumbling upon the same issue with bucket_selector aggregations, here is my workaround:

I will use the example from the docs:

{
  "size": 0,
  "aggs": {
    "sales_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      },
      "aggs": {
        "total_sales": {
          "sum": {
            "field": "price"
          }
        },
        "sales_bucket_filter": {
          "bucket_selector": {
            "buckets_path": {
              "totalSales": "total_sales"
            },
            "script": "params.totalSales < 200"
          }
        }
      }
    }
  }
}

The issue: the script will select only non-empty buckets. If you try something like params.totalSales == null || params.totalSales < 200, it still won't select the empty ones.
Workaround is to make use of the special _count path:

        "sales_bucket_filter": {
          "bucket_selector": {
            "buckets_path": {
              "totalSales": "total_sales",
              "bucketCount": "_count"
            },
            "script": "params.bucketCount == 0 || params.totalSales < 200"
          }
        }

At least in 6.3, this works for bucket_path, but not for bucket_script, unfortunately. Can't say for other pipeline aggs.