https://www.elastic.co/guide/en/elasticsearch/reference/6.1/common-options.html#time-units
Time Units
Whenever durations need to be specified, e.g. for a timeout parameter, the duration must specify the unit, like 2d for 2 days. The supported units are:
Key | Represents
-- | --
d | days
h | hours
m | minutes
s | seconds
ms | milliseconds
micros | microseconds
nanos | nanoseconds
The issue is that this table incomplete and inaccurate. Units below ms make no sense because we do not represent dates at a better resolution than milliseconds. Both 1micros and 1nanos (thankfully) result in "Zero or negative time interval not supported".
However, there are other units that can be used sometimes, but for some reason not other times. Like:
Key | Represents
-- | --
y | years
year | years
M | months
month | months
w | weeks
week | weeks
quarter | 3 months
1y works, but 2y does not. 1M works, but 2M does not. If I recall, the reasoning for these surrounds the changing nature of it.
In some ways, I can appreciate that something like "2y" and 2M are not consistent intervals (because of leap years and inconsistent month lengths). But weeks are consistent and quarters can only be true if you consider Q1 January - March (etc), which I feel like 3M could do too.
Regardless, if we are going to block higher number ranges, then we should look at blocking all number ranges for a given time unit _and_ we need to drop units that simply do not work.
/cc @colings86 I remember talking to you about this a few years ago.
This is tricky and there are a couple of things that I should explain here and we should explain (or explain better) in the docs.
Tl;dr: intervals are hard because time zones are hard and we have made some trade offs which we should probably clarify better in the docs (though its hard to without writing a book in just this subject). For the details of why keep reading the wall of text.
Firstly, there is a fundamental difference between what the interval 1X (where X is one of the keys above) is doing compared to the interval nX (where n is an integer greater than 1). The 1X variant is a calendar interval, meaning that it takes into account the length of the key at that exact time in the calendar[1]. For example if the interval is 1d, the timezone is Europe/London the bucket represents 2018-03-25T00:00:00 it will take into account the clocks moving forward for Daylight Savings Time and the bucket will represent 23 hours instead of 24 hours so the next bucket still begins at 2018-03-26T00:00:00. The nX variant (where n > 1) is a fixed-duration interval, meaning that the length of the interval is independent of the calendar[1] so the length of the interval is fixed. Using the same example as before, if we had a fixed-duration interval of 1 day[2] all buckets would represent a duration of 24 * 60 * 60 * 1000 = 86400000 milliseconds so the 2018-03-25T00:00:00 bucket would be "24 hours" long and the next bucket would actually be 2018-03-25T23:00:00 since that is 24 hours from the previous bucket in the calendar.
It should also be noted that calendar intervals can be different periods from the fixed-duration intervals at pretty much all levels for some arbitrary date and time zone, the only levels that I have not come across a difference is milliseconds and seconds (a second is always 1000 milliseconds in every calendar[1] system I have seen but that may not be universally true). Let me show some examples at different levels. Note that differences between calendar and fixed-duration intervals at one particular level bleed into all the levels above it.
There are little to no rules as to what a particular country/part of a country etc. are allowed to do when it comes to changing their time zone offset. Countries do change time zones offsets, both in one off cases and as regular occurances and these shifts can and have been in awkward amounts like shifting the time zone offset by 45 minutes. There are also not hard and fast rules about when a time zone offset shift might occur. Most time zone offsets are performed at a time far enough into the day that the offset will not cause it to jump back into the previous day or forward into the next day but there are of course exceptions, like America/St John's. This lack of rules means that you really cannot make any assumptions about the duration of any interval that are irrefutable.
The outcome of all the above is that we had to make a decision that we would not support fixed-duration intervals expressed in units higher than days of the form nX and instead would only allow the 1X form since that uses calendar intervals. Honestly the decision to allow the unit days was hard because it causes divergence even in common timezones (its an hour out for half of the year) but its also one of the most common units for wanting to express an nX interval. It should still be noted that you might see divergence from the calendar intervals you expect in smaller intervals too but these are much rarer and the divergence tends to be slower.
nX?This is a good question, and one I asked multiple times before. I also decided about 1.5 years ago that it was compeltely possible and set about implementing it. The problem is that the way that the Joda time date library (and probably java.time too though I haven't looked yet) work messes things up when the n is an awkward value.
The way I went about it was the same as how we implement calendar quarters. For quarters we round the date to the nearest month and then divde that value by 3 to get the quarter for the date. This works perfectly for quarters, but only because there are always exactly 4 quarters in every year, a quarter is always exactly 3 months long and a year is *always exactly 12 months long.
So really this works because the number of nX intervals in the next interval up is a whole number and doesn't change over time (is independent of where you are in the calendar[1]). The problem is that this assumption does not hold for arbitrary n and X so you end up with most of the buckets being say 3 days but then a 1 day bucket at the end of the month or the year to make up the remainder, and thats no good because the user asked for 3-day buckets not "mostly 3-day buckets with the odd bucket that is 1 or 2 days". What you need to know in this case is the number of calendar[1] days[4] since 1970-01-01T00:00:00.000 and then work out the buckets from there but this is 1) not quick to calculate and 2) would need some extra fiddling to ensure that the date histogram started at the beginning of a 3-day bucket and not half way through (since most users would want to count the start of their 3-day bucket at the beginnning of their data rather than at some arbitrary Thursday in 1970).
The time units page is for all places in Elasticsearch where you can specify a time period. Outside of the date histogram this is basically just in settings such as the refresh interval. In these places we only care about it being a fixed-duration interval. We don't allow interval keys higher than days here because of the problems listed above so keeping the intervals in days or lower allows us the user to predict the interval length better. If we allowed months or years here the interval would diverge from what a user might expect by too much. Again the 'days' interval one is a hard one here because it can and does diverge from the calendar[1] interval by an hour for 6 months of every year in some of the common time zones.
No. I think there are some things we can do:
nX calendar[1] intervals is related to https://github.com/elastic/elasticsearch/pull/26659 so I want to get this change in first. Essentially the idea is that a user specifies that they want nX calendar[1] interval buckets from the date histogram. We collect 1X interval buckets and after we have performed the reduce we merge each n consecutive buckets into a single bucket. The cost of the aggregation in terms of memory and time would be equivilent to asking for 1X buckets, but the user would be returned the nX buckets they wanted.If you got here then I am amazed, thanks for reading. Hopefully the above makes some sense.
[1] Here I am considering time zone to be a property of the calendar.
[2] You can't actually do this int he date histogram as the n in nX must be greater than 1 but I am breaking this rule here so the examples can be easily compared
[3] Please note that UTC is different to GMT (this is confused in a lot of places)
[4] note this is calander[1] days so just dividing the current time in millis by 86400000 wont cut it for all the previously mentioned reasons abotu a day not always being 86400000 milliseconds long
[5] It may be theoretically possible, though I'm not sure
What @colings86 said. Nothing longer than a second has a constant length in seconds, so you can't just map things to an equivalent number of seconds (milliseconds, whatever) without losing information. ISO8601 has a nice model for durations (except quarters) that's way too complicated for time units for timeouts etc. but might be a useful guide for more human-facing things.
I wholeheartedly believe that the correct model for things that are supposed to be a whole number of days should be based on counting days, as an integer, and not by mucking around with timestamps that may or may not represent midnight in some timezone or other. I don't understand why things like Joda make this so hard by trying to combine the notion of a day (discrete values) and a time (essentially continuous). They're different things.
What you need to know in this case is the number of calendar[1] days[4] since 1970-01-01T00:00:00.000 and then work out the buckets from there but this is 1) not quick to calculate
I'm not sure about this. Calculating the local timezone offset involves a little bit of a search but we have to do this anyway; calculating the number of days since 1970-01-01 is an addition and a division, given the offset.
2) would need some extra fiddling to ensure that the date histogram started at the beginning of a 3-day bucket
I think the ability to "offset" buckets like this would be very useful. For instance, the UK tax year is a whole year, but offset by -270 days so it always starts on 6 April. I can think of more exotic bucketing strategies that would be useful: for instance, some accounting techniques require years to be a whole number of weeks long, so most "years" are 52 weeks and then every few years there's a 53-week one to catch up. Other accounting techniques require years to be a whole number of 4-week periods long, so most years are 52 weeks long and then occasionally there's a 56-weeker to stop things getting too far out of kilter. The UK railway divides the year into 13 periods, the first of which starts on 1 April, the second starts on the Sunday before the first Thursday in May, and the rest start every 28 days after that. I could go on.
Since it's nearly renewal time for my pedantry badge this year and I'm a few points short, I'm compelled to add:
A week is always 7 days
Except the weeks containing 6 or 8 days that happen when a country decides to move itself across the international date line. We don't talk about those, tho.
A year is always exactly 12 months long
Unless we start to support aggregations based on things like the Hebrew calendar. They have leap _months_.
February can have 29 days if the year is divisible by 4, unless its divisble by 100 when it has 28 days, unless its divisble by 400 when it has 29 days again
... and there was this one time in Sweden in 1712 when it had 30 days.
The time units page is for all places in Elasticsearch where you can specify a time period. Outside of the date histogram this is basically just in settings such as the refresh interval.
There is also watcher throttle_period interval, machine learning has query_delay, latency, bucket_span, rollover has max_age. Just a random pick but quite a few places in the API take a time unit.
@elastic/es-search-aggs
Closing this in favour of https://github.com/elastic/elasticsearch/issues/29410 and https://github.com/elastic/elasticsearch/issues/29411 which have been created based on a FiItFriday discussion of how to progress this issue
Most helpful comment
This is tricky and there are a couple of things that I should explain here and we should explain (or explain better) in the docs.
Tl;dr: intervals are hard because time zones are hard and we have made some trade offs which we should probably clarify better in the docs (though its hard to without writing a book in just this subject). For the details of why keep reading the wall of text.
Firstly, there is a fundamental difference between what the interval
1X(where X is one of the keys above) is doing compared to the intervalnX(where n is an integer greater than 1). The1Xvariant is a calendar interval, meaning that it takes into account the length of the key at that exact time in the calendar[1]. For example if the interval is1d, the timezone isEurope/Londonthe bucket represents2018-03-25T00:00:00it will take into account the clocks moving forward for Daylight Savings Time and the bucket will represent 23 hours instead of 24 hours so the next bucket still begins at2018-03-26T00:00:00. ThenXvariant (where n > 1) is a fixed-duration interval, meaning that the length of the interval is independent of the calendar[1] so the length of the interval is fixed. Using the same example as before, if we had a fixed-duration interval of 1 day[2] all buckets would represent a duration of24 * 60 * 60 * 1000 = 86400000milliseconds so the2018-03-25T00:00:00bucket would be "24 hours" long and the next bucket would actually be2018-03-25T23:00:00since that is 24 hours from the previous bucket in the calendar.It should also be noted that calendar intervals can be different periods from the fixed-duration intervals at pretty much all levels for some arbitrary date and time zone, the only levels that I have not come across a difference is milliseconds and seconds (a second is always 1000 milliseconds in every calendar[1] system I have seen but that may not be universally true). Let me show some examples at different levels. Note that differences between calendar and fixed-duration intervals at one particular level bleed into all the levels above it.
There are little to no rules as to what a particular country/part of a country etc. are allowed to do when it comes to changing their time zone offset. Countries do change time zones offsets, both in one off cases and as regular occurances and these shifts can and have been in awkward amounts like shifting the time zone offset by 45 minutes. There are also not hard and fast rules about when a time zone offset shift might occur. Most time zone offsets are performed at a time far enough into the day that the offset will not cause it to jump back into the previous day or forward into the next day but there are of course exceptions, like
America/St John's. This lack of rules means that you really cannot make any assumptions about the duration of any interval that are irrefutable.The outcome of all the above is that we had to make a decision that we would not support fixed-duration intervals expressed in units higher than days of the form
nXand instead would only allow the1Xform since that uses calendar intervals. Honestly the decision to allow the unit days was hard because it causes divergence even in common timezones (its an hour out for half of the year) but its also one of the most common units for wanting to express annXinterval. It should still be noted that you might see divergence from the calendar intervals you expect in smaller intervals too but these are much rarer and the divergence tends to be slower.So why don't we just support calendar intervals of the form
nX?This is a good question, and one I asked multiple times before. I also decided about 1.5 years ago that it was compeltely possible and set about implementing it. The problem is that the way that the Joda time date library (and probably java.time too though I haven't looked yet) work messes things up when the
nis an awkward value.The way I went about it was the same as how we implement calendar quarters. For quarters we round the date to the nearest month and then divde that value by 3 to get the quarter for the date. This works perfectly for quarters, but only because there are always exactly 4 quarters in every year, a quarter is always exactly 3 months long and a year is *always exactly 12 months long.
So really this works because the number of
nXintervals in the next interval up is a whole number and doesn't change over time (is independent of where you are in the calendar[1]). The problem is that this assumption does not hold for arbitrarynandXso you end up with most of the buckets being say3 daysbut then a1 daybucket at the end of the month or the year to make up the remainder, and thats no good because the user asked for 3-day buckets not "mostly 3-day buckets with the odd bucket that is 1 or 2 days". What you need to know in this case is the number of calendar[1] days[4] since1970-01-01T00:00:00.000and then work out the buckets from there but this is 1) not quick to calculate and 2) would need some extra fiddling to ensure that the date histogram started at the beginning of a 3-day bucket and not half way through (since most users would want to count the start of their 3-day bucket at the beginnning of their data rather than at some arbitrary Thursday in 1970).TimeUnit is not just for date histograms
The time units page is for all places in Elasticsearch where you can specify a time period. Outside of the date histogram this is basically just in settings such as the refresh interval. In these places we only care about it being a fixed-duration interval. We don't allow interval keys higher than days here because of the problems listed above so keeping the intervals in days or lower allows us the user to predict the interval length better. If we allowed months or years here the interval would diverge from what a user might expect by too much. Again the 'days' interval one is a hard one here because it can and does diverge from the calendar[1] interval by an hour for 6 months of every year in some of the common time zones.
So all is lost with solving this?
No. I think there are some things we can do:
nXcalendar[1] intervals is related to https://github.com/elastic/elasticsearch/pull/26659 so I want to get this change in first. Essentially the idea is that a user specifies that they wantnXcalendar[1] interval buckets from the date histogram. We collect1Xinterval buckets and after we have performed the reduce we merge eachnconsecutive buckets into a single bucket. The cost of the aggregation in terms of memory and time would be equivilent to asking for1Xbuckets, but the user would be returned thenXbuckets they wanted.If you got here then I am amazed, thanks for reading. Hopefully the above makes some sense.
[1] Here I am considering time zone to be a property of the calendar.
[2] You can't actually do this int he date histogram as the
ninnXmust be greater than 1 but I am breaking this rule here so the examples can be easily compared[3] Please note that UTC is different to GMT (this is confused in a lot of places)
[4] note this is calander[1] days so just dividing the current time in millis by
86400000wont cut it for all the previously mentioned reasons abotu a day not always being86400000milliseconds long[5] It may be theoretically possible, though I'm not sure