Hello!
I have CSV file from here.
It's in UTF-16 LE encoding, and code
csv = CSV.parse File.read("devices.csv")
failed with output:
$ crystal src/test.cr
Unexpected byte 0xfe at position 1, malformed UTF-8 (InvalidByteSequenceError)
[4700487] *CallStack::unwind:Array(Pointer(Void)) +87
[4700378] *CallStack#initialize:Array(Pointer(Void)) +10
[4700330] *CallStack::new:CallStack +42
[4620888] *raise<InvalidByteSequenceError>:NoReturn +24
[5125669] ???
[5124664] *Char::Reader#decode_current_char:Char +232
[5124418] *Char::Reader#initialize<String>:Char +34
[5124322] *Char::Reader::new<String>:Char::Reader +98
[5225163] *CSV::Lexer::StringBased#initialize<String, Char, Char>:(Int32 | Nil) +59
[5225073] *CSV::Lexer::StringBased::new<String, Char, Char>:CSV::Lexer::StringBased +161
[5224902] *CSV::Lexer::new<String, Char, Char>:CSV::Lexer::StringBased +6
[5224401] *CSV::Parser#initialize<String, Char, Char>:Int32 +17
[5224361] *CSV::Parser::new<String, Char, Char>:CSV::Parser +105
[5224241] *CSV::parse<String>:Array(Array(String)) +49
[5222152] *Device#read_from_file<Nil>:String +56
[5222083] *Device#read_from_file:String +19
[5222057] *Device#initialize:String +25
[5222007] *Device::new:Device +87
[4650342] ~DEVICE:init +6
[4574105] ???
[4633593] main +41
[139644618220241] __libc_start_main +241
[4570858] _start +42
[0] ???
And with encoding param it's ok:
csv = CSV.parse File.read("devices.csv", encoding: "UTF-16 LE")
But it will be cool if File.read can detect encoding automatically, not UTF-8 by default only, because I don't know in which encoding next versions of this file will be, and don't want to save with needed (hard-coded) encoding before application start.
What are you thinking about it? Maybe I don't know about something?
Thanks.
Encoding detection is extremely involved, with ambiguous results in many cases and can only ever be a heuristic anyway. It requires to read and scan through the entire file, always. It would slow down every file read. There are different approaches to it that work better in different situations.
I think this is completely out of scope of the standard library, doing it by default anyway. You should know the encoding of your files, if you don't you should prefer to figure it out once. Only if it keeps changing and you can't predict it you should defer to a detection library. The standard library should require you to be explicit and knowing, it should fail loud and clear.
However what we should do is set the default external encoding from the system settings.
That is, except the file has a BOM. Maybe, only when opening files, we could try to detect the encoding if there's a BOM. I think several languages do this (like Go and Ruby, though I'm not sure).
@AlexWayfer Does your file have a BOM?
@jhass I agree, but found in text editors that's working pretty good, so just want some magic in basic File#read method of wonderful language :) If it's hard to implement here — OK, no problems.
Some references: Atom Editor use JS port of this library: https://github.com/chardet/chardet
I think, it's can be helpful.
@asterite Yeah, that file has BOM.
I think detecting with BOM is OK, but doing a guess like what chardet does is not good. It probably needs to read a lot of the file and maybe do some stats and frequency? I don't know... It's probably good to do this in a text editor, but not generally in every program.
I honestly don't know much about BOM other than what it stands for, so if anyone wants to tackle this, go ahead.
For reference, here is the Ruby code that opens a File, where here it tries to detect the BOM, with this function.
It's probably good to do this in a text editor, but not generally in every program.
It's can be optional, not is?
File.read("devices.csv", encoding: :auto)
...or something like this.
By default UTF-8, as now, but as option it can be cool.
Not a problem without it (for me now), but it can be for somebody, who knows.
Actually, it seems Ruby only checks BOM if it's instructed so, so I'm not sure we should auto-detect anything...
General (heuristic) encoding detection should be provided by a library. I'm closing this.
Most helpful comment
That is, except the file has a BOM. Maybe, only when opening files, we could try to detect the encoding if there's a BOM. I think several languages do this (like Go and Ruby, though I'm not sure).
@AlexWayfer Does your file have a BOM?