Elasticsearch version (bin/elasticsearch --version): 7.6.2
Plugins installed: []
JVM version (java -version): ESS (GCP)
OS version (uname -a if on a Unix-like system): ESS (GCP)
Description of the problem including expected versus actual behavior:
I'm trying to use the Data visualizer to upload a file and I get this error:
File could not be read
[illegal_argument_exception] Could not find a timestamp in the sample provided
Steps to reproduce:
Please try uploading the attached file.
This error is weird because this file does have a timestamp, and also because i can upload other files that are having no timestamp at all and that work fine.
Thank you in advance for your answer.
Pinging @elastic/ml-core (:ml)
I have the same issue and I am very interested by the fix!
This error is weird because this file does have a timestamp
Your timestamp format is yyyyMMdd, which is not one of the ones that is detected out-of-the-box. You need to override the timestamp using the timestamp_format=yyyyMMdd URL argument on the find_file_structure endpoint - see https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-find-file-structure.html. Then the structure will be detected.
Run this and it works fine (replace the username, password and ES server hostname as appropriate):
curl -u elastic:password -s -H "Content-Type: application/json" -XPOST "http://localhost:9200/_ml/find_file_structure?pretty×tamp_format=yyyyMMdd" -T no_timestamp_issue.txt
However, it sounds like you're accessing that endpoint via Kibana, and you cannot do that until elastic/kibana#38868 is implemented.
also because i can upload other files that are having no timestamp at all and that work fine.
These would be NDJSON or delimited files. Currently we require a timestamp in semi-structured text files because the lines are grouped into messages by assuming that the first line of every message has the timestamp.
We have an enhancement request open, #55219, to allow the user to say there is no timestamp, in which case semi-structured text files would be assumed to have one line per message. However, to take advantage of that through the UI would also require that elastic/kibana#38868 was implemented.
Basically you should upvote elastic/kibana#38868.
Thanks for your detailed answer.
I just upvoted for kibana issue that you mentioned
Thank you @droberts195
However, I'm able to upload other files with no timestamp, and it works just fine. there is no timestamp identified.
See this example below.
See this example below.
That file has a delimited format (CSV). Currently we require a timestamp in semi-structured text files because the lines are grouped into messages by assuming that the first line of every message has the timestamp. But we do not require a timestamp in NDJSON or delimited files.
I see...
good catch @droberts195
Sometimes (like in my first example), spaces and semi colons can be replaced by commas and convert it to csv! cc @fbaligand
waiting for the enhancement ;)
OK, so now, I understand more the problem.
ML Data Visualizer does not recognize a CSV file when delimiter is ";" (and not ",").
For example, I get the "[illegal_argument_exception] Could not find a timestamp in the sample provided" error when I upload this file:
a;b;c
d;e;f
g;h;i
a;b;c
d;e;f
g;h;i
If I retry with "," delimiter, it works fine:
a,b,c
d,e,f
g,h,i
a,b,c
d,e,f
g,h,i
That is sad because Microsoft Excel uses ";" delimiter when it generates a CSV export.
So, if I understand well, it will be fixed when https://github.com/elastic/kibana/issues/38868 will be fixed?
ML Data Visualizer does not recognize a CSV file when delimiter is ";" (and not ",").
That's not quite true, it can sometimes recognize semi-colon separated files, but only with 4 fields per row: see https://github.com/elastic/elasticsearch/blob/fd554d95e462232ee9799c82ba8faea11ac481a9/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/FileStructureFinderManager.java#L283
So a workaround could be to add an extra column to your file so it has 4 fields per row.
The reason it requires more semi-colons than commas to detect a delimited format is that we found semi-colons appear more often in files that a human would classify as semi-structured text log files than commas do (at least in the sample files we looked at when developing the feature).
Of course, if you override the delimiter then any number of fields per row is enough: see https://github.com/elastic/elasticsearch/blob/fd554d95e462232ee9799c82ba8faea11ac481a9/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/FileStructureFinderManager.java#L482-L485
And you are correct that in order to specify an exact delimiter in the UI when the initial analysis failed you need elastic/kibana#38868.
/cc @elastic/ml-ui
Thanks for your answer and your tip!
I just tested with 4 columns, and it works fine!
Thank you @droberts195 !
From what I know, excel us reads & writes CSV files with a , delimiter whereas in france it uses ;
In Excel (in france) the "save as" with type "CSV UTF-8 (delimited by commas)" actually outputs a csv delimited by a semi-colon.
When reading a CSV, Excel Fr also expects a ";" delimiter
So I guess if Elasticsearch wants to be French friendly, supporting the semi-colon would be great ;)
Yes! Elasticsearch is a French people friend ;)
supporting the semi-colon would be great
It's a bit misleading to say it's "not supported" today. The semi-colon delimiter is not detected with fewer than 4 fields.
If you explicitly say the separator is semi-colon with any number of fields it works.
curl -u elastic:password -s -H "Content-Type: application/json" -XPOST "localhost:9200/_ml/find_file_structure?pretty&explain&format=delimited&delimiter=%3B" -T- <<EOF
a;b;c
d;e;f
g;h;i
a;b;c
d;e;f
g;h;i
EOF
This file works fine without any hint:
a;b;c;1
d;e;f;1
g;h;i;1
a;b;c;1
d;e;f;1
g;h;i;1
The difference compared to commas is that you need a minimum of 4 fields for auto-detection. For commas it's 2.
I am actually wondering if commas should be 3.
Consider this file:
1,1
2,2
3,3
4,4
5,5
Is it:
Correct answer: don't know.
At the moment the file structure finder decides 1. Maybe it shouldn't make a decision at all. Or maybe it needs to have an idea of the user's locale, but that's not trivial. We couldn't use the Elasticsearch server locale - I imagine in Cloud for example that doesn't vary across regions. It would have to be passed in as a URL argument to find_file_structure.
In the short term we are back to the user having to give a hint. All roads lead to elastic/kibana#38868.
Thanks for this detailed answer and especially for the hint to specify explicitly the delimiter.
As you say, all roads lead to kibana issue you mentioned.
Hope it will be soon implemented :)
Most helpful comment
Thank you @droberts195 !
From what I know, excel us reads & writes CSV files with a , delimiter whereas in france it uses ;
In Excel (in france) the "save as" with type "CSV UTF-8 (delimited by commas)" actually outputs a csv delimited by a semi-colon.
When reading a CSV, Excel Fr also expects a ";" delimiter
So I guess if Elasticsearch wants to be French friendly, supporting the semi-colon would be great ;)