Cudf: [BUG] casting to TIMESTAMP_SECONDS from TIMESTAMP_MICROSECONDS is off by 1 for negative values

Created on 12 May 2020 · 11Comments · Source: rapidsai/cudf

Describe the bug
When casting a TIMESTAMP_MICROSECONDS value to TIMESTAMP_SECONDS I have to add one second to the resulting value if the TIMESTAMP_MICROSECONDS value is negative (before Jan 1, 1970)

Steps/Code to reproduce bug
I reproduced this using spark. I don't know of another way to make it happen. If you need me to I can try and make a c++ unit test for this.

Expected behavior
Spark and cudf produce the same result.

? - Needs Triage Spark bug libcudf

Source

revans2

All 11 comments

This sounds related to #3664

jlowe on 12 May 2020

This sounds related to #3664

Nope. I reverted the change made for #3664 That causes casting to days to fail in spark, but it had no impact on the MICROS to SECONDS issue. The patch is simple enough I think I did it right.

revans2 on 12 May 2020

👍1

A repro is still greatly appreciated.

devavret on 13 May 2020

C++ repro

TEST_F(CastTimestamps, PriorTo1970usTos)
{
  using namespace cudf::test;

  // 1965-10-26 14:01:12.762000000 GMT
  auto timestamps_us = cudf::test::fixed_width_column_wrapper<cudf::timestamp_us>({-131968727238000});
  auto timestamps_s = cudf::experimental::cast(timestamps_us, cudf::data_type{cudf::TIMESTAMP_SECONDS});
  cudf::test::print(timestamps_s->view());
}

expected:1965-10-26T14:01:12Z

output:1965-10-27T00:00:00Z

@revans2 please confirm.

rgsl888prabhu on 13 May 2020

@rgsl888prabhu this seems more a bug in test::print (which relies on cudf::strings::from_timestamps) than the cast. e.g. adding these two lines seem to pass the test:

auto expected = fixed_width_column_wrapper<cudf::timestamp_s>({-131968728});
expect_columns_equal(timestamps_s->view(), expected);

devavret on 13 May 2020

👍1

I'll try to get a repro case shortly

revans2 on 13 May 2020

Here is the test case that I came up with.

diff --git a/cpp/tests/unary/unary_ops_test.cu b/cpp/tests/unary/unary_ops_test.cu
index a62f46197..2d1ec83af 100644
--- a/cpp/tests/unary/unary_ops_test.cu
+++ b/cpp/tests/unary/unary_ops_test.cu
@@ -594,6 +594,19 @@ void validate_cast_result(cudf::column_view expected, cudf::column_view actual)
 struct CastTimestampsSimple : public cudf::test::BaseFixture {
 };

+TEST_F(CastTimestampsSimple, PriorTo1970usTos)
+{
+  using namespace cudf::test;
+
+  // 723-3-21 8:29:37.781000 GMT
+  auto timestamps_us = make_column<cudf::timestamp_us>(std::vector<int64_t>{-39344693986219000});
+  // Spark produces -39344693986 cudf produces -39344693987
+  auto expected_s = make_column<cudf::timestamp_s>(std::vector<int64_t>{-39344693986});
+  auto timestamps_s = cudf::experimental::cast(timestamps_us, cudf::data_type{cudf::TIMESTAMP_SECONDS});
+
+  validate_cast_result<cudf::timestamp_s, cudf::timestamp_s>(expected_s, *timestamps_s); 
+}
+
 TEST_F(CastTimestampsSimple, IsIdempotent)
 {
   using namespace cudf::test;

revans2 on 13 May 2020

This does pass by reverting #3670.

The logic is: if it's negative, we do need to floor it.
timestamp_s(-1) == 1969/12/31 23:59:59
So the day is still 31 December. It won't be the next day until one more second.
timestamp_us(-1) == 1969/12/31 23:59:59.999999
It won't be 1970/1/1 for 0.000001 more microseconds. So at the seconds resolution, the time is still 1969/12/31 23:59:59 == timestamp_s(-1)

https://godbolt.org/z/aVVFNH

devavret on 13 May 2020

Normally users want the behavior that truncating a timestamp to a lower resolution doesn't change behavior based on when the timestamp occurs relative to some arbitrary epoch position that users may not even know. I verified CPU date_trunc behavior in both PostgreSQL and Spark on a timestamp before the epoch matches what was implemented in #3670. Seems like Spark is being a bit self-inconsistent here based on whether we're using date_trunc or unix_timestamp.

Maybe I was arguing for the wrong level of compatibility in #3670. SQL wants the current behavior, but apparently C++ and Java are truncating to zero rather than to negative infinity, even for time values, which produces these weird results where it truncates timestamps after the epoch but ceilings timestamps before the epoch.

We just need to pick what cudf is going to do here and the Spark plugin can do some postprocessing if necessary to adjust to that behavior. Adding @jrhemstad and @kkraus14 for their opinions. I don't know the Pandas expected behavior for downcasting the resolution of timestamps before and after the epoch.

jlowe on 13 May 2020

Yes because Spark is self inconsistent I am fine with following a single standard and working around it in the places we need to.

revans2 on 14 May 2020

@revans2 so should I close this then?

devavret on 15 May 2020

Was this page helpful?

0 / 5 - 0 ratings