When reading a CSV string, some fields that should be parsed as strings are parsed as dates.
This happens when the string ends with something that resembles a date.
Use the following snippet to reproduce.
var foo = 'foo,bar\nfoo2,bar2\n"Wololo thing - ends Feb. 15, 2017","Wololo thing - ends Feb. 15, 2017 bar"';
var w = XLSX.read(foo, {type: "string", cellDates: true});
console.log(w.Sheets.Sheet1);
Expected result: the third row contains 2 strings:
{t: "s", v: "Wololo thing - ends Feb. 15, 2017"}
{t: "s", v: "Wololo thing - ends Feb. 15, 2017 bar"}
Actual result: the third row contains
{t: "d", v: Wed Feb 15 2017 00:00:00 GMT-0200 (-02), w: "2/15/17"}
{t: "s", v: "Wololo thing - ends Feb. 15, 2017 bar"}
For plaintext formats like CSV and HTML, the current approach for determining the data type is testing whether the string represents a valid Date and whether the month is mentioned. V8 (chrome/nodejs) is extremely lenient in this regard. You can test in the JS console:
> new Date("Wololo thing - ends Feb. 15, 2017")
Wed Feb 15 2017 00:00:00 GMT-0500 (EST)
Realistically, we'll have to pick a small set of explicit formats to test against.
@brollins90 the plaintext (HTML/DSV) date resolution is in the fuzzydate function. At a high level, a string is converted to a Date and the function checks for some known corner cases. It also inspects the string to see if it's likely to be a date.
The guess is trickier because Excel lets you mix text with date formats. For example, the string Today is Jan 25 2018 can be reproduced by setting a cell to 43125 with format "Today is "mmm dd yyyy. Should the string be considered a date or a string?
For example, the string Today is Jan 25 2018 can be reproduced by setting a cell to 43125 with format "Today is "mmm dd yyyy. Should the string be considered a date or a string?
This seems crazy to me! Keep it as a string of course!!!
This behavior should be disabled on request. I'm importing a CSV (tried with xls too) to perform corrections and it keeps on converting values like "VIA MARTUCCI 2" to the date "3/2/01", I tried using raw: true but it simply uses the number "36952". Only with xlsx files it doesn't convert these values.
I am having the same issue here. A column with IDs like "0002-001-1" gets imported as "2/1/01". I also tried using "raw: true" but I get a weird value: "36892.041666666664"
Is there no way around this?
I'm using xlsx.read(foo, { type: 'string', raw:true }); and I'm getting the dates in the expected plain text format
I'm running into this problem when using sheet_to_json and it doesn't appear that I can override to indicate to just bring in the string and not attempt to type it as a n or date. Parses the string fine for xls/xlsx files, but when I do a csv it casts to n and takes Dec 20 => 12/20/2001.
To those who are getting weird number for the date row when using sheet_to_json function,
remember to include the raw: false option in sheet_to_json
const workbook = excel.readFile(file.path);
const sheet_name_list = workbook.SheetNames;
const json = excel.utils.sheet_to_json(workbook.Sheets[sheet_name_list[0]], {
raw: false,
});
raw: false
solved for this
@SheetJSDev there seems to be a lot of issues with dates and the cellDates option. I think the major issue here is that cellDates causes sheetjs to try and parse everything as a date, including guessing for string cells.
I think this behaviour is wrong. If we're reading in files with string dates (especially CSV files) then it's likely that we know best how to parse them, so leave them alone. But if we're reading in an xlsx file then we need sheetjs to convert that stupid number format into a date for us.
We only have two options at the moment: parse nothing or parse everything. But what we want is to parse everything that's clearly marked as a date (in the type information, not by any guessing), and leave everything else alone.
This should be the default behaviour. If we want sheetjs to also try to parse strings into dates we should have to specify an option for that (although I don't see the point of complicating sheetjs with such an option when it's easier to just do this afterwards). If we want to get the raw values for those silly number dates, then we should have to specific an option for that (either raw or a separate option).
There are two key points here:
xlsx.SSF.parse_date_code unless there was something wrong with the type information or we specifically requested raw mode.Edit: Apparently this guessing happens even with cellDates: false. You have to set raw: true to get it to stop guessing, but then you get strings for everything, including numbers (when parsing a CSV file)
Excel does its own date and value parsing from CSV. To confirm this, save the following to a CSV file and open in Excel:
2020-05-07,=YEAR(INDIRECT("A"&ROW()))
="2020-05-07",=TYPE(INDIRECT("A"&ROW()))
"2020-05-07",=TYPE(INDIRECT("A"&ROW()))
2020,=TYPE(INDIRECT("A"&ROW()))
FALSE,=TYPE(INDIRECT("A"&ROW()))
=NA(),=TYPE(INDIRECT("A"&ROW()))
Cells A1 and A3 are interpreted as dates, while cell A2 is kept as text. In fact, you'll notice that the expressions starting with "=" are actually interpreted as formulae rather than text!
raw: true lets you skip the whole value parsing logic, giving you the original text that you are free to interpret however you wish.
The root of the problem in this issue is the flexibility of V8 in date parsing. Consider:
new Date("How many days does it take for Mars to go around the sun? 687")
We can probably agree that string isn't a date, but V8 thinks it is Tuesday March 1 in the year 687
Ok so with raw:true, cellDates: true it seems to work how I expect, with the caveat that I also have to parse numbers myself.
I'm really confused by the folks saying they have to use raw: false - that's surely the opposite of what you want here?
Most helpful comment
I'm using xlsx.read(foo, { type: 'string', raw:true }); and I'm getting the dates in the expected plain text format