Elasticsearch: [DOCS] Time Units for Date Histogram Interval incomplete

Created on 30 Jan 2018 · 6Comments · Source: elastic/elasticsearch

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/common-options.html#time-units

Time Units

Whenever durations need to be specified, e.g. for a timeout parameter, the duration must specify the unit, like 2d for 2 days. The supported units are:

Key | Represents
-- | --
d | days
h | hours
m | minutes
s | seconds
ms | milliseconds
micros | microseconds
nanos | nanoseconds

The issue is that this table incomplete and inaccurate. Units below ms make no sense because we do not represent dates at a better resolution than milliseconds. Both 1micros and 1nanos (thankfully) result in "Zero or negative time interval not supported".

However, there are other units that can be used sometimes, but for some reason not other times. Like:

1y works, but 2y does not. 1M works, but 2M does not. If I recall, the reasoning for these surrounds the changing nature of it.

In some ways, I can appreciate that something like "2y" and 2M are not consistent intervals (because of leap years and inconsistent month lengths). But weeks are consistent and quarters can only be true if you consider Q1 January - March (etc), which I feel like 3M could do too.

Regardless, if we are going to block higher number ranges, then we should look at blocking all number ranges for a given time unit _and_ we need to drop units that simply do not work.

:AnalyticAggregations >docs

Source

pickypg

Most helpful comment

This is tricky and there are a couple of things that I should explain here and we should explain (or explain better) in the docs.

Tl;dr: intervals are hard because time zones are hard and we have made some trade offs which we should probably clarify better in the docs (though its hard to without writing a book in just this subject). For the details of why keep reading the wall of text.

Firstly, there is a fundamental difference between what the interval 1X (where X is one of the keys above) is doing compared to the interval nX (where n is an integer greater than 1). The 1X variant is a calendar interval, meaning that it takes into account the length of the key at that exact time in the calendar[1]. For example if the interval is 1d, the timezone is Europe/London the bucket represents 2018-03-25T00:00:00 it will take into account the clocks moving forward for Daylight Savings Time and the bucket will represent 23 hours instead of 24 hours so the next bucket still begins at 2018-03-26T00:00:00. The nX variant (where n > 1) is a fixed-duration interval, meaning that the length of the interval is independent of the calendar[1] so the length of the interval is fixed. Using the same example as before, if we had a fixed-duration interval of 1 day[2] all buckets would represent a duration of 24 * 60 * 60 * 1000 = 86400000 milliseconds so the 2018-03-25T00:00:00 bucket would be "24 hours" long and the next bucket would actually be 2018-03-25T23:00:00 since that is 24 hours from the previous bucket in the calendar.

It should also be noted that calendar intervals can be different periods from the fixed-duration intervals at pretty much all levels for some arbitrary date and time zone, the only levels that I have not come across a difference is milliseconds and seconds (a second is always 1000 milliseconds in every calendar[1] system I have seen but that may not be universally true). Let me show some examples at different levels. Note that differences between calendar and fixed-duration intervals at one particular level bleed into all the levels above it.

Minutes - Some calendars[1] have a concept of leap seconds, one such example are calendars using UTC[3] (Coordinated Universal Time). In these calendar[1] systems not all minutes have 60 seconds, if a minute includes a leap second it will have 61 seconds. Leap seconds are rarely included in the kinds of analyses we expect with aggregations and are rare even in calendar systems[1] that use them but I include them here as an example that non of the interval keys are safe.
Hours - There have been instances where time zones have changed their offset by less that an hour at a particular point in time. This either causes an hour with more than 60 minutes in it or an hour with less that 60 minutes in it. See https://www.timeanddate.com/news/time/venezuela-change-timezone.html for an example of this, more on this wierdness below.
Day - The most typical example of days which have more or less that 24 hours is the twice yearly Daylight Savings shift that some countries perform causeing a 23 hour day in the spring and a 25 hour day in the autumn. This is the first interval that is regularly not of a fixed length in common time zones.
Week - A week is always 7 days but since the days/hours/minutes in the week are not of fixed length in the calendar a week cannot be considered ot be either.
Months - I think this is the interval where the problem is most commonly appreciated. Months can range from 28 to 31 days which on its own makes a fixed-duration for a month pretty impossible (you could take an average but that would still mean that most of the time the calendar month and fixed-duration month would not coincide). This is then compounded by the fact that February can have 29 days if the year is divisible by 4, unless its divisble by 100 when it has 28 days, unless its divisble by 400 when it has 29 days again (https://www.timeanddate.com/date/leapyear.html).
Quarters - Fixed-durartion quarters are not possible because of the problems with months, each quarter does not have a fixed number of days in it let alone a fixed number of milliseconds
Year - Similarly to quarter and months, because years do not have a fixed number of days they do not have a fixed number of milliseconds so calendar and fixed-duration years will easily diverge

There are little to no rules as to what a particular country/part of a country etc. are allowed to do when it comes to changing their time zone offset. Countries do change time zones offsets, both in one off cases and as regular occurances and these shifts can and have been in awkward amounts like shifting the time zone offset by 45 minutes. There are also not hard and fast rules about when a time zone offset shift might occur. Most time zone offsets are performed at a time far enough into the day that the offset will not cause it to jump back into the previous day or forward into the next day but there are of course exceptions, like America/St John's. This lack of rules means that you really cannot make any assumptions about the duration of any interval that are irrefutable.

The outcome of all the above is that we had to make a decision that we would not support fixed-duration intervals expressed in units higher than days of the form nX and instead would only allow the 1X form since that uses calendar intervals. Honestly the decision to allow the unit days was hard because it causes divergence even in common timezones (its an hour out for half of the year) but its also one of the most common units for wanting to express an nX interval. It should still be noted that you might see divergence from the calendar intervals you expect in smaller intervals too but these are much rarer and the divergence tends to be slower.

So why don't we just support calendar intervals of the form `nX`?

This is a good question, and one I asked multiple times before. I also decided about 1.5 years ago that it was compeltely possible and set about implementing it. The problem is that the way that the Joda time date library (and probably java.time too though I haven't looked yet) work messes things up when the n is an awkward value.

The way I went about it was the same as how we implement calendar quarters. For quarters we round the date to the nearest month and then divde that value by 3 to get the quarter for the date. This works perfectly for quarters, but only because there are always exactly 4 quarters in every year, a quarter is always exactly 3 months long and a year is *always exactly 12 months long.

So really this works because the number of nX intervals in the next interval up is a whole number and doesn't change over time (is independent of where you are in the calendar[1]). The problem is that this assumption does not hold for arbitrary n and X so you end up with most of the buckets being say 3 days but then a 1 day bucket at the end of the month or the year to make up the remainder, and thats no good because the user asked for 3-day buckets not "mostly 3-day buckets with the odd bucket that is 1 or 2 days". What you need to know in this case is the number of calendar[1] days[4] since 1970-01-01T00:00:00.000 and then work out the buckets from there but this is 1) not quick to calculate and 2) would need some extra fiddling to ensure that the date histogram started at the beginning of a 3-day bucket and not half way through (since most users would want to count the start of their 3-day bucket at the beginnning of their data rather than at some arbitrary Thursday in 1970).

TimeUnit is not just for date histograms

The time units page is for all places in Elasticsearch where you can specify a time period. Outside of the date histogram this is basically just in settings such as the refresh interval. In these places we only care about it being a fixed-duration interval. We don't allow interval keys higher than days here because of the problems listed above so keeping the intervals in days or lower allows us the user to predict the interval length better. If we allowed months or years here the interval would diverge from what a user might expect by too much. Again the 'days' interval one is a hard one here because it can and does diverge from the calendar[1] interval by an hour for 6 months of every year in some of the common time zones.

So all is lost with solving this?

No. I think there are some things we can do:

As mentioned at the beginning of this wall of text, I think we can and should update the date histogram docs to explain this better, thoguh we need to be careful it doesn't turn into this wall of text (if you are still reading I am amazed, annd sorry for not being more concise)
Since calendars are hard we should try to properly work out what assumptions and trade-offs we want to apply and be clear about them. I don't think its practically possible[5] to get this right for all time zones over all time since their don't seem to be clear rules that you can follow to the letter which don't have exceptions to them. So instead of trying to cope with everything we should try to work out what we need to cope with and be explicit about. For example we could list what we believe to be a "well behaved" timezone and state that we support those timezones which conform to those rules and may well work on others but our focus is on those.
An idea I want to explore for nX calendar[1] intervals is related to https://github.com/elastic/elasticsearch/pull/26659 so I want to get this change in first. Essentially the idea is that a user specifies that they want nX calendar[1] interval buckets from the date histogram. We collect 1X interval buckets and after we have performed the reduce we merge each n consecutive buckets into a single bucket. The cost of the aggregation in terms of memory and time would be equivilent to asking for 1X buckets, but the user would be returned the nX buckets they wanted.
Lastly, I think there are cases for fixed-duration intervals in the date histogram aggregation so maybe if the above idea works we might want to split the date histogram aggregation in two and provide calendar[1] intervals and fixed-duration intervals in separate aggregations

If you got here then I am amazed, thanks for reading. Hopefully the above makes some sense.

[1] Here I am considering time zone to be a property of the calendar.
[2] You can't actually do this int he date histogram as the n in nX must be greater than 1 but I am breaking this rule here so the examples can be easily compared
[3] Please note that UTC is different to GMT (this is confused in a lot of places)
[4] note this is calander[1] days so just dividing the current time in millis by 86400000 wont cut it for all the previously mentioned reasons abotu a day not always being 86400000 milliseconds long
[5] It may be theoretically possible, though I'm not sure

colings86 on 30 Jan 2018

👍3

All 6 comments

/cc @colings86 I remember talking to you about this a few years ago.

pickypg on 30 Jan 2018

This is tricky and there are a couple of things that I should explain here and we should explain (or explain better) in the docs.

Minutes - Some calendars[1] have a concept of leap seconds, one such example are calendars using UTC[3] (Coordinated Universal Time). In these calendar[1] systems not all minutes have 60 seconds, if a minute includes a leap second it will have 61 seconds. Leap seconds are rarely included in the kinds of analyses we expect with aggregations and are rare even in calendar systems[1] that use them but I include them here as an example that non of the interval keys are safe.
Hours - There have been instances where time zones have changed their offset by less that an hour at a particular point in time. This either causes an hour with more than 60 minutes in it or an hour with less that 60 minutes in it. See https://www.timeanddate.com/news/time/venezuela-change-timezone.html for an example of this, more on this wierdness below.
Day - The most typical example of days which have more or less that 24 hours is the twice yearly Daylight Savings shift that some countries perform causeing a 23 hour day in the spring and a 25 hour day in the autumn. This is the first interval that is regularly not of a fixed length in common time zones.
Week - A week is always 7 days but since the days/hours/minutes in the week are not of fixed length in the calendar a week cannot be considered ot be either.
Months - I think this is the interval where the problem is most commonly appreciated. Months can range from 28 to 31 days which on its own makes a fixed-duration for a month pretty impossible (you could take an average but that would still mean that most of the time the calendar month and fixed-duration month would not coincide). This is then compounded by the fact that February can have 29 days if the year is divisible by 4, unless its divisble by 100 when it has 28 days, unless its divisble by 400 when it has 29 days again (https://www.timeanddate.com/date/leapyear.html).
Quarters - Fixed-durartion quarters are not possible because of the problems with months, each quarter does not have a fixed number of days in it let alone a fixed number of milliseconds
Year - Similarly to quarter and months, because years do not have a fixed number of days they do not have a fixed number of milliseconds so calendar and fixed-duration years will easily diverge

So why don't we just support calendar intervals of the form `nX`?

TimeUnit is not just for date histograms

So all is lost with solving this?

No. I think there are some things we can do:

As mentioned at the beginning of this wall of text, I think we can and should update the date histogram docs to explain this better, thoguh we need to be careful it doesn't turn into this wall of text (if you are still reading I am amazed, annd sorry for not being more concise)
Since calendars are hard we should try to properly work out what assumptions and trade-offs we want to apply and be clear about them. I don't think its practically possible[5] to get this right for all time zones over all time since their don't seem to be clear rules that you can follow to the letter which don't have exceptions to them. So instead of trying to cope with everything we should try to work out what we need to cope with and be explicit about. For example we could list what we believe to be a "well behaved" timezone and state that we support those timezones which conform to those rules and may well work on others but our focus is on those.
An idea I want to explore for nX calendar[1] intervals is related to https://github.com/elastic/elasticsearch/pull/26659 so I want to get this change in first. Essentially the idea is that a user specifies that they want nX calendar[1] interval buckets from the date histogram. We collect 1X interval buckets and after we have performed the reduce we merge each n consecutive buckets into a single bucket. The cost of the aggregation in terms of memory and time would be equivilent to asking for 1X buckets, but the user would be returned the nX buckets they wanted.
Lastly, I think there are cases for fixed-duration intervals in the date histogram aggregation so maybe if the above idea works we might want to split the date histogram aggregation in two and provide calendar[1] intervals and fixed-duration intervals in separate aggregations

If you got here then I am amazed, thanks for reading. Hopefully the above makes some sense.

colings86 on 30 Jan 2018

👍3

What @colings86 said. Nothing longer than a second has a constant length in seconds, so you can't just map things to an equivalent number of seconds (milliseconds, whatever) without losing information. ISO8601 has a nice model for durations (except quarters) that's way too complicated for time units for timeouts etc. but might be a useful guide for more human-facing things.

I wholeheartedly believe that the correct model for things that are supposed to be a whole number of days should be based on counting days, as an integer, and not by mucking around with timestamps that may or may not represent midnight in some timezone or other. I don't understand why things like Joda make this so hard by trying to combine the notion of a day (discrete values) and a time (essentially continuous). They're different things.

What you need to know in this case is the number of calendar[1] days[4] since 1970-01-01T00:00:00.000 and then work out the buckets from there but this is 1) not quick to calculate

I'm not sure about this. Calculating the local timezone offset involves a little bit of a search but we have to do this anyway; calculating the number of days since 1970-01-01 is an addition and a division, given the offset.

2) would need some extra fiddling to ensure that the date histogram started at the beginning of a 3-day bucket

I think the ability to "offset" buckets like this would be very useful. For instance, the UK tax year is a whole year, but offset by -270 days so it always starts on 6 April. I can think of more exotic bucketing strategies that would be useful: for instance, some accounting techniques require years to be a whole number of weeks long, so most "years" are 52 weeks and then every few years there's a 53-week one to catch up. Other accounting techniques require years to be a whole number of 4-week periods long, so most years are 52 weeks long and then occasionally there's a 56-weeker to stop things getting too far out of kilter. The UK railway divides the year into 13 periods, the first of which starts on 1 April, the second starts on the Sunday before the first Thursday in May, and the rest start every 28 days after that. I could go on.

Since it's nearly renewal time for my pedantry badge this year and I'm a few points short, I'm compelled to add:

A week is always 7 days

Except the weeks containing 6 or 8 days that happen when a country decides to move itself across the international date line. We don't talk about those, tho.

A year is always exactly 12 months long

Unless we start to support aggregations based on things like the Hebrew calendar. They have leap _months_.

February can have 29 days if the year is divisible by 4, unless its divisble by 100 when it has 28 days, unless its divisble by 400 when it has 29 days again

... and there was this one time in Sweden in 1712 when it had 30 days.

DaveCTurner on 30 Jan 2018

The time units page is for all places in Elasticsearch where you can specify a time period. Outside of the date histogram this is basically just in settings such as the refresh interval.

There is also watcher throttle_period interval, machine learning has query_delay, latency, bucket_span, rollover has max_age. Just a random pick but quite a few places in the API take a time unit.