Openrefine: Excel dates imported as java.util.Date, not OpenRefine date

Created on 15 Dec 2018  路  13Comments  路  Source: OpenRefine/OpenRefine

Describe the bug
Date columns from Excel files are imported as text into Refine.

To Reproduce
Steps to reproduce the behavior:

  1. Create a new project using an Excel file with dates in it, e.g. the one attached below.
  2. The configuration screen for project creation will show the date column in black. Create the project anyway to try to convert it later on.
  3. The date is read correctly from the Excel file (December 11th, 2018), but displayed in black, not green, i.e. not recognised as date.
  4. Apply Common Transforms > To Date to the "Date" column. The date is still stored as text. See screenshot below.

Current Results
What results occured or were shown.

Expected behavior
A clear and concise description of what you expected to happen or to show.

Screenshots
screenshot 2018-12-15 at 01 15 52

Desktop (please complete the following information):

  • OS: macOS Mojave
  • Browser Version: Safari 12.02
  • JRE or JDK Version: 1.8.0_181

OpenRefine (please complete the following information):

  • Version 3.1 [b90e413]

Datasets
Minimal file: Date test.xlsx

Additional context
This same file worked fine in Refine 2.8.

XLS(X) bug good second issue import High

Most helpful comment

Yeah, yet another reason to just import these things as plain strings, I would say.

All 13 comments

I suspect this may not just be an Excel issue but rather an issue with dates starting with a weekday name - I can't get Tue Dec 11 00:00:00 GMT 2018 to parse with toDate() currently (but Dec 11 00:00:00 GMT 2018 works fine)

We need to add a test case for this and then fix the toDate to work

Once this is working we can re-test the Excel import - there may be other issues but I think fixing the date parsing for this form of date is the starting point

OK - I may need to revise my first opinion as @ettorerizza has pointed out in #1287 that he can get dates of this form to parse OK in 3.1

I'll investigate further

Yes, I confirm that Tue Dec 11 00:00:00 GMT 2018 can be parsed with value.toDate("EEE MMM dd h:m:s z yyyy"), at least in OR 3.1.

It looks like this issue is specifically connected with the import from Excel. If I use the supplied xlsx file from above then after import toDate() does not work. If apply a transform to the cell with just value (i.e. should be no change) then apply toDate() this then works.

This suggests there is something about the cell content after the import that is causing a problem

@ostephens what about other importer formats (.xls .ods) and not just .xslx?

@ostephens Looks like the problem with the Excel file comes from the blank spaces between each part of the date-time string, since you can parse it using this Grel formula :

value.replace("\p{Blank}", " ").toDate("EEE MMM dd h:m:s z yyyy")

Edit

This one works too :

value.toString().toDate("EEE MMM dd h:m:s z yyyy")

When directly imported from Excel, the date "string" is not really a string, but a java.util.Date object. The difference is indistinguishable to the naked eye.

screenshot-127 0 0 1-3333-2019 01 08-00-34-21

Still an issue in 3.3

Maybe this issue is pertaining to if dateCompatibility attribute is defined in the original .xlsx ? (not sure what POI is doing with this)
Can you look with Notepad or text editor for that attribute in the .xlsx file?

I found this note in https://interoperability.blob.core.windows.net/files/MS-OI29500/[MS-OI29500].pdf#page=247

The standard states that the dateCompatibility attribute determines whether the date base should be treated as a compatibility date base or should support the full date range of [ISO-8601]. Office ignores the dateCompatibility attribute. If workbook@conformance equals "strict", the 1900 date system is used. Otherwise the 1900 compatibility date system is used. Excel does not support negative serial numbers. This note applies to the following products: Office 2013 Client (Strict), Office 2013 Server (Strict), Office 2013 Client (Transitional), Office 2013 Server (Transitional).

When importing the attached Excel file, OR automatically changes 00:00:00 to Sun Dec 31 00:00:00 CET 1899 and 2020 to 2020.0 (visible only in facets). Retransforming the time to 00:00:00 is not trivial, since the GREL formula value.toString().toDate("EEE MMM dd h:m:s z yyyy").toString("hh:mm:ss") can return 11:00:00 and not 00:00:00 according to the user's locale.

See for example this post on StackOverflow.

IMHO, this issue should be prioritized as Excel is the reference tool for most new users of OR. This kind of hard to fix bug can discourage them from going further.

In the meantime, I suggest advising users faced with the problem to export their Excel file in CSV. It seems that a .ODS export can't be opened in OpenRefine anymore (but this is another bug).

Classeur1.xlsx

Yeah, yet another reason to just import these things as plain strings, I would say.

@wetneb ok, anything will be fine if well documented ;-)

@ettorerizza's https://github.com/OpenRefine/OpenRefine/issues/1908#issuecomment-452117627 has the critical information. The Excel importer is creating java.util.Date objects instead of OpenRefine dates. The two used to be the same, but the migration to the Java 8 API in 19f98b7ea2a012065a9e21999a9f9dc30347daf7 broke this equivalence when the definition of an OpenRefine date changed:

- if (v instanceof Calendar || v instanceof Date) {
+ if (v instanceof OffsetDateTime) {

and the Excel importer wasn't updated.

To fix this:

  • first, create a (failing) test which imports the test file provided above (thank you @dcabo !) and verifies that the cell value is imported as an OpenRefine date
  • then, repair the code to fix the incompatibility and confirm that the test is green

It's unfortunate that this regression has languished since OpenRefine 3.0, but it should be straightforward for someone to fix.

Although I changed this to an OffsetDateTime, which is what OpenRefine currently uses internally, using LocalDateTime might be more appropriate since Excel doesn't have the concept of timezones.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

stellasia picture stellasia  路  4Comments

wetneb picture wetneb  路  3Comments

kushthedude picture kushthedude  路  3Comments

ettorerizza picture ettorerizza  路  4Comments

thadguidry picture thadguidry  路  3Comments