Describe the bug
Using Diff and comparing Dates for equality in 2 columns sometimes leads to unexpected results.
To Reproduce
Steps to reproduce the behavior:
end_date value is 2900-01-01-T00:00:00Z and column begin_date value is 1923-08-03-T00:00:00Zdiff(cells['end_date'].value, cells['begin_date'].value, "days")Current Results
?
Expected behavior
Negative number is expected
Desktop (please complete the following information):
OpenRefine (please complete the following information):
Additional context
See Mailing List https://groups.google.com/d/msg/openrefine/cwgIgA7-d8k/OJmEAc4nBAAJ
This is possibly being caused by an incorrect comparison.
From java.time.OffsetDateTime notes: "This is a value-based class; use of identity-sensitive operations (including reference equality (==), identity hash code, or synchronization) on instances of OffsetDateTime may have unpredictable results and should be avoided. The equals method should be used for comparisons." - https://docs.oracle.com/javase/8/docs/api/java/time/OffsetDateTime.html
Our code is here:
https://github.com/OpenRefine/OpenRefine/blob/ebaad96dfc33f44b1a0904800cd8bdac7c655220/main/src/com/google/refine/expr/functions/strings/Diff.java#L58
I agree that's an important issue but "critical" is probably a bit over the top, no? :)
Is this definitely a problem on 3.1/3.2? I did some limited testing earlier and seemed ok. Happy to test further but would appreciate any definitive examples.
Note report on mailing list is for LODRefine which is now v outdated & not directly supported by main project
@wetneb We typically used critical if it affects the reliability of data transforms. This is borderline, and in those cases in the past, both Tom and David would mark as critical, so I'm following our precedent that we had. But happy to change that precedent.
One thing that would help is just to isolate the bug as the evaluation result of a single GREL expression, such as diff(toDate(...), toDate(...), ...), and check in which version the behaviour changed (if any).
Checking:
In both cases I get -70378 which is what I'd expect.
Re-reading the thread on the Google group I can see that the LODRefine user also reports that they get a negative number - which is as expected. I've asked @ettorerizza for clarification, but I think maybe we don't actually have an immediate problem here.
Looking at the code, the Date is actually converted to epoch in nanonseconds and do diff calculation there before converting back from nanoseconds into the requested diff units - so we aren't doing any identity-sensitive operations on OffsetDateTime objects
I'll wait for Ettore to clarify, but suspect this can be closed
@ostephens On OR 3.2 with Mac OSX, I got a positive (and thus wrong) result on Olivia's dataset, but a correct answer using directly your formula.


Types and epochs looks correct.

Note: the results in OR 3.1 are exactly the same as in 3.2, but the result in 2.8 is radically different.
OR 2.8

Note 2: Positive or negative, 70378 days correspond to less than two centuries. This is not at all the number of days between 2900 and 1923.
I agree with Thad that this should be considered as a critical issue, since it affects the reliability of OpenRefine.
The problem seems to apply to dates that differ by > 106750 days. Compare:
diff(toDate("1700-01-10"),toDate("1992-04-21"),"days") - > +106751
with
diff(toDate("1700-01-10"),toDate("1992-04-19"),"days") -> -106750
In terms of the difference being too small - once it flips the difference starts to reduce rather than grow:
diff(toDate("1700-01-10"),toDate("2285-01-01"),"years") -> 0
It will then climb back up to 106750 days again before flipping again
What could be the particular meaning of 106750 days (more or less 292 years)? It looks like the problems started from the moment the diffs were calculated in nanoseconds rather than milliseconds. Could it be that it is simply a question of a figure that is too big and therefore unmanageable?
Yeah - it looks like we are hitting the maximum Long in Java. Max long =
9,223,372,036,854,775,807
106752 days in milliseconds
9,223,372,800,000,000,000
Jacky treated the delta as a long and it probably should have been BigInteger https://docs.oracle.com/javase/8/docs/api/java/math/BigInteger.html After it is changed to that, the differences in dates can span as far as you have memory on a 64bit machine :-)
I think this function should actually be rewritten to use the Duration class, which is designed for that:
https://docs.oracle.com/javase/8/docs/api/java/time/Duration.html
I expect this could solve this issue in the same go.
or Duration, yeah, even better, which can store longer than the age of the universe.
A physical duration could be of infinite length. For practicality, the duration is stored with constraints similar to Instant. The duration uses nanosecond resolution with a maximum value of the seconds that can be held in a long. This is greater than the current estimated age of the universe.
Hi, first timer here. Do you mind if I take it?
Are there any files other than main/src/com/google/refine/expr/functions/strings/Diff.java that should be altered?
Hi @impune-pl, thank you very much for offering to work on this! It would be very welcome.
The code changes would be in Diff.java indeed. In addition to that, you could write tests for it (with the examples given above, for instance) in main/tests/server/src/com/google/refine/tests/expr/functions/strings/DiffTests.java. If you are not familiar with such tests I would be happy to help.
Hi @wetneb, I'm happy I can help - I've been using open source software for a long time, so it's time to give something back to the community.
I hope I'll be able to handle testing on my own, but if I encounter any problems, I won't hesitate to ask for your help.
In the meantime I'd like to ask you a question about expected behaviour of this function. I can see it does not take different lengths of months into account, using flat 30 days = 1 month conversion factor. Should I keep it that way, or should I use java.time.Duration for time units shorter than one day, and java.time.Period to get years, months and days?
Also, Duration.toNanos() and Duration.toMillis() throw ArithmeticException in case of overflow. What should I do with it? Personally I think it'd be best to create a new error class or reuse com.google.refine.expr.EvalError to return an informative error message, but I'd like to hear what you think about it.
The 30 days = 1 month factor is clearly bad, using java.time.Duration and java.time.Period will be much better.
Concerning ArithmeticException, throwing an EvalError (or a subclass of it) sounds appropriate indeed.
Most helpful comment
Yeah - it looks like we are hitting the maximum Long in Java. Max long =
9,223,372,036,854,775,807106752 days in milliseconds
9,223,372,800,000,000,000