Elasticsearch version: 7.4.2
Plugins installed: [repository-s3, discovery-ec2]
JVM version: Bundled version
OS version: 4.14.138-114.102.amzn2.x86_64 (Amazon Linux 2)
Description of the problem including expected versus actual behavior:
Running a date_histogram on a date_range field with document values that have null "lte" appears to cause the aggregation to create an infinite number of buckets and I get an out of memory error. The aggregation works well for fully defined date ranges (the first document in my example), but I have a large number of documents that have an undefined end point (value is null, second document in my example). I have been looking for a way to limit the buckets created in the query, but without luck. I've also tried to use a date_range aggregation on the fields to attempt to limit the potential buckets, but that caused cast exceptions.
Basically I'm attempting to find all of the months up to now that these documents cover/touch. I'd like to see buckets from 2017-10-01 to 2019-12-01 (as of the time of this issue being written).
Steps to reproduce:
PUT test
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"active_range": {
"type": "date_range",
"format": "yyyy-MM-dd"
}
}
}
}
POST test/_doc/
{
"active_range": {
"gte": "2017-10-10",
"lte": "2018-10-10"
}
}
POST test/_doc/
{
"active_range": {
"gte": "2017-10-10",
"lte": null
}
}
GET test/_search
{
"size": 0,
"aggs": {
"active": {
"date_histogram": {
"field": "active_range",
"calendar_interval": "month"
}
}
}
}
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)
/cc @not-napoleon mind taking a look at this when you get a chance?
Yeah, I definitely see why that would happen. It's a little unclear to me what the correct behavior would be though.
In this case, it seems like we want to default the end date to now(), but I'm not sure that's a good general case behavior. What happens if the start point is null, do we default that to now() as well? What happens if the default generates an invalid range, do we just skip that document? What about open ended Numeric ranges, or IPs (although, if there's a sensible solution for Dates that doesn't generalize to other ranges, I have no problem just fixing this for DateHistogram)
Missing doesn't currently let you specify a range, but this seems almost like missing behavior. Just thinking out loud, if we let users specify a Range value for missing, and then partially applied it if the start or end point was null (and used the whole range if the value was actually missing), that might give users some ability to control this. We'd still need to deal with invalid ranges in the aggregator though. Unfortunately, getting missing to support complex types would be a pretty major change. It's something I'd like to do, but I don't have a plan for it right now.
I'm hesitant to add an explicit option (two options really, for start and end) for this case. That creates a situation where we have options on the aggregation that only apply sometimes, or worse we end up needing a new aggregation entirely. One of our goals with adding range support was to minimize API footprint as much as possible, so in that spirit, I'd like to keep adding a new option as a last resort.
TL;DR - if there's a sensible default value, this is a relatively straightforward fix, but I'm not convinced there's a sensible default value. At an absolute minimum, we should at least just skip these docs instead of trying to create infinite buckets.
@jcrapuchettes I ran your test case locally, and it hit a CircuitBreakerException, not an actual OutOfMemoryError - can you confirm this is what you're seeing? To be clear, the circuit breaker should kick in before it actually runs out of memory, and is a well-handled error case.
@not-napoleon I do get a CircuitBreakerException.
Having a default value of now() for missing would be great for my case, but I can understand the difficulty of dealing with all cases. Is there any way to use a script or something else to mimic the default value in current versions? Our queries are really big and ugly since we can't use the date_histogram.
Scripting would be a good solution, but we don't currently have support for getting range values from scripts in aggregations. I've opened an issue for this (#50190 ), if you want to follow the discussion there. Unfortunately, I don't think there's a scripting based work around in 7.4
I've been experiencing a very similar issue.
I have ~180M documents spanning the past 4 years or so. Each document has a date_range field that could be anywhere from a few days to a few years. When I perform a date_histogram on the date range I always get buckets for _every_ date in the range the documents reside in and there isn't a way to restrict to a subset of those buckets.
When querying by month it isn't so bad, I just get more data back than I need (all months between the minimum month in the matching documents and the maximum month). A real problem comes up, however, when I aggregate on days. One query where I'm trying to see data for say 30 days, will actually bucket for _all_ days in the range, which could be upward of 1500 buckets, one for each day in the matching document's range (not my query range). Most of the time this results in a timeout in Elasticsearch and I never actually get data back.
This issue is preventing me from adopting the date_histogram aggregation for date_range fields. If there were something like extended_bounds that could restrict the aggregation from looking at buckets that don't fall in the bounded range, that would solve both my problem and the OP's problem, I believe.
An example query I would be running (using the same field name as the OP) would be something like this:
{
"track_total_hits": true,
"query": {
"range": {
"active_range": {
"gte": "2016-11-01",
"lte": "2016-11-30"
}
}
},
"size": 0,
"aggs": {
"timeseries": {
"date_histogram": {
"field": "active_range",
"calendar_interval": "day",
"min_doc_count": 1,
"extended_bounds": {
"min": "2016-11-01",
"max": "2016-11-30"
}
}
}
}
}
Since documents back in 2016 could be "active" today the ES query is generating all buckets from whatever the first day in the document range is all the way to today.
It _feels_ like extended_bounds could tell Elasticsearch not to aggregate buckets outside the specified range, but the documentation says it explicitly does not do that. So either extended_bounds could start doing that (breaking change) or we could introduce another field bounds or something like that, which would be a hard enforcement at aggregation time.
A script might be a good solution for this, but it feels like a field similar to extended_bounds would make a bit more sense for this type of aggregation. Or if there were a way to tell bucket aggregations to ignore certain buckets at aggregation time (like a bucket selector aggregation, only operating before aggregation instead of after), which could be useful in more than just these kind of range histogram aggregations.
I discussed this with @polyfractal today, and we're not opposed to adding a flag to make the extended_bounds values hard limits on what buckets are computed. Something like:
"extended_bounds": {
"min": "2016-11-01",
"max": "2016-11-30",
"hard_limit": true
}
(exact option name still TBD) Which would return exactly all buckets between the given min and max. This would still include the current behavior of adding empty buckets to fill out the range, but also serve to clip the results that fall outside of that range. This is not entirely useless for non-range cases either, as it will be a cleaner syntax than the current recommendation of adding a filter query. Would something along those lines solve your difficulty?
@not-napoleon that solution would fit my use case perfectly. Thanks for taking the time to discuss this, it'll simplify what I'm currently doing significantly.
@not-napoleon would it still count in values with open end in all buckets valid for given start date or filter them out?
@krlm: @not-napoleon can correct me if I get this wrong, but I believe any unbounded interval on the "edge" will still be included in the final bucket on that side. It will just stop the aggregation from continuing to "fill out" the histogram with buckets that match the unbounded, infinite interval.
Yeah, that's more or less what I was thinking. Currently it just adds buckets until it adds a bucket that contains the end point. If a hard limit extended bounds were set, it would stop at that limit instead. Basically, you'd set the hard limit and get exactly those buckets, even if your ranges fell outside of them. You'd still have the existing extend_bounds behavior where we add in empty buckets as needed to fit that range, too.
@polyfractal @not-napoleon, thanks for the info. I forgot that by default there's no upper and lower boundaries (which I've defined in the filter part of my query) for date histogram aggregation. This should do the job then.
I've had the same issue and would love to see support for date histograms on open-ended date range fields.
The way I'm dealing with this right now is by using scripting (and #50190 will definitely help here, too). It's clunky, it's not flexible but it works for what I need to do.
In my documents, I have a date_range field (say period) but also the two date fields for the start and end of the date range and sometimes the latter can be null. So in my date_histogram aggregation I can deal with this open-ended situation by defaulting to now if the end date is missing, i.e. if the period is open-ended towards the future (but it would work the same towards the past):
POST my-index/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"histo": {
"date_histogram": {
"script": """
def start = doc['start_date'].value;
def end = null;
if (doc['end_date'].size() > 0) {
end = doc['end_date'].value;
} else {
// default to now
end = Instant.ofEpochMilli(new Date().getTime()).atZone(ZoneId.of("UTC"));
}
// build buckets array
def buckets = [start];
def months = ChronoUnit.MONTHS.between(start, end) + 1;
while (months > 0) {
start = start.plusMonths(1);
buckets.add(start);
months--;
}
// return the date buckets
return buckets;
""",
"interval": "month"
}
}
}
}
As I said, it's clunky, but it does return all the date buckets I'm interested in (either from start to end or from start to now if end is null). Of course, depending on the interval type, the code needs to adapted.
Just thought I'd share if that can help anyone until a real fix is available.
Here is an interesting question about hard bounds. With extended bounds the max bound is inclusive. So, if we have an hourly histogram and max is 10 hours, the 10:00:00-10:59:59 bucket will be included, which might make sense since we are extending the bounds. The hard bounds, on the other hands, are limiting. So, if max hard bound is 10:00:00 the bucket 10:00:00-10:59:59 is mostly outside of the bounds.
Most helpful comment
I discussed this with @polyfractal today, and we're not opposed to adding a flag to make the
extended_boundsvalues hard limits on what buckets are computed. Something like:(exact option name still TBD) Which would return exactly all buckets between the given min and max. This would still include the current behavior of adding empty buckets to fill out the range, but also serve to clip the results that fall outside of that range. This is not entirely useless for non-range cases either, as it will be a cleaner syntax than the current recommendation of adding a filter query. Would something along those lines solve your difficulty?