Crystal: Auto-detect encoding

Created on 8 Aug 2016 · 7Comments · Source: crystal-lang/crystal

Hello!
I have CSV file from here.

It's in UTF-16 LE encoding, and code

csv = CSV.parse File.read("devices.csv")

failed with output:

$ crystal src/test.cr 
Unexpected byte 0xfe at position 1, malformed UTF-8 (InvalidByteSequenceError)
[4700487] *CallStack::unwind:Array(Pointer(Void)) +87
[4700378] *CallStack#initialize:Array(Pointer(Void)) +10
[4700330] *CallStack::new:CallStack +42
[4620888] *raise<InvalidByteSequenceError>:NoReturn +24
[5125669] ???
[5124664] *Char::Reader#decode_current_char:Char +232
[5124418] *Char::Reader#initialize<String>:Char +34
[5124322] *Char::Reader::new<String>:Char::Reader +98
[5225163] *CSV::Lexer::StringBased#initialize<String, Char, Char>:(Int32 | Nil) +59
[5225073] *CSV::Lexer::StringBased::new<String, Char, Char>:CSV::Lexer::StringBased +161
[5224902] *CSV::Lexer::new<String, Char, Char>:CSV::Lexer::StringBased +6
[5224401] *CSV::Parser#initialize<String, Char, Char>:Int32 +17
[5224361] *CSV::Parser::new<String, Char, Char>:CSV::Parser +105
[5224241] *CSV::parse<String>:Array(Array(String)) +49
[5222152] *Device#read_from_file<Nil>:String +56
[5222083] *Device#read_from_file:String +19
[5222057] *Device#initialize:String +25
[5222007] *Device::new:Device +87
[4650342] ~DEVICE:init +6
[4574105] ???
[4633593] main +41
[139644618220241] __libc_start_main +241
[4570858] _start +42
[0] ???

And with encoding param it's ok:

csv = CSV.parse File.read("devices.csv", encoding: "UTF-16 LE")

But it will be cool if File.read can detect encoding automatically, not UTF-8 by default only, because I don't know in which encoding next versions of this file will be, and don't want to save with needed (hard-coded) encoding before application start.

What are you thinking about it? Maybe I don't know about something?

Thanks.

draft stdlib

Source

AlexWayfer

👎3

Most helpful comment

That is, except the file has a BOM. Maybe, only when opening files, we could try to detect the encoding if there's a BOM. I think several languages do this (like Go and Ruby, though I'm not sure).

@AlexWayfer Does your file have a BOM?

asterite on 8 Aug 2016

👍2

All 7 comments

Encoding detection is extremely involved, with ambiguous results in many cases and can only ever be a heuristic anyway. It requires to read and scan through the entire file, always. It would slow down every file read. There are different approaches to it that work better in different situations.

I think this is completely out of scope of the standard library, doing it by default anyway. You should know the encoding of your files, if you don't you should prefer to figure it out once. Only if it keeps changing and you can't predict it you should defer to a detection library. The standard library should require you to be explicit and knowing, it should fail loud and clear.

However what we should do is set the default external encoding from the system settings.

jhass on 8 Aug 2016

👍2

That is, except the file has a BOM. Maybe, only when opening files, we could try to detect the encoding if there's a BOM. I think several languages do this (like Go and Ruby, though I'm not sure).

@AlexWayfer Does your file have a BOM?

asterite on 8 Aug 2016

👍2

@jhass I agree, but found in text editors that's working pretty good, so just want some magic in basic File#read method of wonderful language :) If it's hard to implement here — OK, no problems.
Some references: Atom Editor use JS port of this library: https://github.com/chardet/chardet
I think, it's can be helpful.

@asterite Yeah, that file has BOM.

AlexWayfer on 8 Aug 2016

I think detecting with BOM is OK, but doing a guess like what chardet does is not good. It probably needs to read a lot of the file and maybe do some stats and frequency? I don't know... It's probably good to do this in a text editor, but not generally in every program.

I honestly don't know much about BOM other than what it stands for, so if anyone wants to tackle this, go ahead.

For reference, here is the Ruby code that opens a File, where here it tries to detect the BOM, with this function.

asterite on 9 Aug 2016

👍1

It's probably good to do this in a text editor, but not generally in every program.

It's can be optional, not is?

File.read("devices.csv", encoding: :auto)

...or something like this.

By default UTF-8, as now, but as option it can be cool.

Not a problem without it (for me now), but it can be for somebody, who knows.

AlexWayfer on 9 Aug 2016

Actually, it seems Ruby only checks BOM if it's instructed so, so I'm not sure we should auto-detect anything...

asterite on 9 Aug 2016

General (heuristic) encoding detection should be provided by a library. I'm closing this.

asterite on 13 Dec 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

dockerhub crystallang/crystal:0.19.2 does not ship with yaml

grosser · 3Comments

Rewrite sigfault.c in Crystal

asterite · 3Comments

ICE Bug: `def refill_in_buffer [...]` in crystal/src/io/encoding.cr:123:13 has no type

TechMagister · 3Comments

Allow accessing named tuples with strings

costajob · 3Comments

Lazy getter macro

asterite · 3Comments