Crystal: iconv bug related to decoding UTF-8

Created on 24 Sep 2016  Â·  14Comments  Â·  Source: crystal-lang/crystal

require "base64"
s = Base64.decode "Irf+zvHG97K7vt+xuNa00NC4w8frx/PL+dDotcS5psTcIgo="
m = MemoryIO.new(s)
m.set_encoding("UTF-8", invalid: :skip)
p m.gets_to_end
Unexpected byte 0xf7 at position 1, malformed UTF-8 (InvalidByteSequenceError)
[4405959] *CallStack::unwind:Array(Pointer(Void)) +87
[4405850] *CallStack#initialize:Array(Pointer(Void)) +10
[4405802] *CallStack::new:CallStack +42
[4379448] *raise<InvalidByteSequenceError>:NoReturn +24
[4485253] ???
[4485078] ???
[4485327] *Char::Reader#next_char:Char +47
[4416475] *String#inspect<IO::FileDescriptor>:IO::FileDescriptor +395
[4391483] *p<String>:String +27
[4356273] ???
[4390025] main +41
[139702314276717] __libc_start_main +237
[4354585] ???

seems iconv work correct here, but p is not

bug stdlib

All 14 comments

This is not valid UTF-8, I don't know what's your point.

Same in Python:

>>> base64.b64decode('Irf+zvHG97K7vt+xuNa00NC4w8frx/PL+dDotcS5psTcIgo=').decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 1: invalid start byte

@BlaXpirit I think you missed invalid: :skip.

ruby

Base64.decode64('Irf+zvHG97K7vt+xuNa00NC4w8frx/PL+dDotcS5psTcIgo=').encode("UTF-8", invalid: :replace, undef: :replace)
=> "\"��������������������������������\"\n"

Maybe duplicate of #2159 ?

(I still want to do what I say in my comment, I just want to find the best/correct way to do it)

is not .chars should work on string converted with invalid: :skip? it should be valid utf8 ,because sanitized all invalid input?

in ruby it returns:

["\"", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "\"", "\n"]

It seems like an iconv bug to me:

s = Bytes[247, 178, 187, 190]
m = MemoryIO.new(s)
m.set_encoding("UTF-8", invalid: :skip)
s2 = m.gets_to_end
p s2.to_slice # => Slice[247, 178, 187, 190]
p s2

But that sequence is invalid UTF-8, I don't know why iconv doesn't treat it as invalid... I'll try to search if this bug exists, but it's hard to search.

yes, seem iconv bug, in ruby:

[247, 178, 187, 190].pack('c*').encode('utf-8', invalid: :replace, undef: :replace).chars.map &:ord
[65533, 65533, 65533, 65533]

Iconv.new("UTF-8//IGNORE", "UTF-8//IGNORE").iconv([247, 178, 187, 190].pack('c*')).chars.map &:ord
ArgumentError: invalid byte sequence in UTF-8
        from (irb):32:in `ord'
        from (irb):32:in `map'
        from (irb):32

I'm reopening because Crystal should work fine here. We are using iconv and it has a bug so we can either report the bug and wait for a fix or use another implementation (icu, or own, etc.)

i think command should be iconv -f UTF-8//IGNORE -t UTF-8//IGNORE invalid.txt

@kostya Yes, any of the two commands are good to reproduce the problem: with IGNORE the output should be empty, without IGNORE iconv should give an error.

After 0.20.4 release, we should use IO::Memory instead of MemoryIO.
So the updated snippet:

require "base64"
s = Base64.decode "Irf+zvHG97K7vt+xuNa00NC4w8frx/PL+dDotcS5psTcIgo="
m = IO::Memory.new(s)
m.set_encoding("UTF-8", invalid: :skip)
p m.gets_to_end

return the follwing error with 0.27.2

Syntax error in /home/maroo/open_source/ss.cr:3: unexpected token: Irf

s = Base64.decode "Irf+zvHG97K7vt+xuNa00NC4w8frx/PL+dDotcS5psTcIgo="

So this isn't an issue anymore, right?

A syntax error shouldn't be alright. But there isn't one: https://carc.in/#/r/6pr5

fixed for me in osx and linux.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

grosser picture grosser  Â·  3Comments

lbguilherme picture lbguilherme  Â·  3Comments

ArthurZ picture ArthurZ  Â·  3Comments

pbrusco picture pbrusco  Â·  3Comments

will picture will  Â·  3Comments