require "base64"
s = Base64.decode "Irf+zvHG97K7vt+xuNa00NC4w8frx/PL+dDotcS5psTcIgo="
m = MemoryIO.new(s)
m.set_encoding("UTF-8", invalid: :skip)
p m.gets_to_end
Unexpected byte 0xf7 at position 1, malformed UTF-8 (InvalidByteSequenceError)
[4405959] *CallStack::unwind:Array(Pointer(Void)) +87
[4405850] *CallStack#initialize:Array(Pointer(Void)) +10
[4405802] *CallStack::new:CallStack +42
[4379448] *raise<InvalidByteSequenceError>:NoReturn +24
[4485253] ???
[4485078] ???
[4485327] *Char::Reader#next_char:Char +47
[4416475] *String#inspect<IO::FileDescriptor>:IO::FileDescriptor +395
[4391483] *p<String>:String +27
[4356273] ???
[4390025] main +41
[139702314276717] __libc_start_main +237
[4354585] ???
seems iconv work correct here, but p is not
This is not valid UTF-8, I don't know what's your point.
Same in Python:
>>> base64.b64decode('Irf+zvHG97K7vt+xuNa00NC4w8frx/PL+dDotcS5psTcIgo=').decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 1: invalid start byte
@BlaXpirit I think you missed invalid: :skip.
ruby
Base64.decode64('Irf+zvHG97K7vt+xuNa00NC4w8frx/PL+dDotcS5psTcIgo=').encode("UTF-8", invalid: :replace, undef: :replace)
=> "\"��������������������������������\"\n"
Maybe duplicate of #2159 ?
(I still want to do what I say in my comment, I just want to find the best/correct way to do it)
is not .chars should work on string converted with invalid: :skip? it should be valid utf8 ,because sanitized all invalid input?
in ruby it returns:
["\"", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "\"", "\n"]
It seems like an iconv bug to me:
s = Bytes[247, 178, 187, 190]
m = MemoryIO.new(s)
m.set_encoding("UTF-8", invalid: :skip)
s2 = m.gets_to_end
p s2.to_slice # => Slice[247, 178, 187, 190]
p s2
But that sequence is invalid UTF-8, I don't know why iconv doesn't treat it as invalid... I'll try to search if this bug exists, but it's hard to search.
yes, seem iconv bug, in ruby:
[247, 178, 187, 190].pack('c*').encode('utf-8', invalid: :replace, undef: :replace).chars.map &:ord
[65533, 65533, 65533, 65533]
Iconv.new("UTF-8//IGNORE", "UTF-8//IGNORE").iconv([247, 178, 187, 190].pack('c*')).chars.map &:ord
ArgumentError: invalid byte sequence in UTF-8
from (irb):32:in `ord'
from (irb):32:in `map'
from (irb):32
I'm reopening because Crystal should work fine here. We are using iconv and it has a bug so we can either report the bug and wait for a fix or use another implementation (icu, or own, etc.)
I reported it, let's hope they reply :-)
http://lists.gnu.org/archive/html/bug-gnu-libiconv/2016-09/msg00003.html
i think command should be iconv -f UTF-8//IGNORE -t UTF-8//IGNORE invalid.txt
@kostya Yes, any of the two commands are good to reproduce the problem: with IGNORE the output should be empty, without IGNORE iconv should give an error.
After 0.20.4 release, we should use IO::Memory instead of MemoryIO.
So the updated snippet:
require "base64"
s = Base64.decode "Irf+zvHG97K7vt+xuNa00NC4w8frx/PL+dDotcS5psTcIgo="
m = IO::Memory.new(s)
m.set_encoding("UTF-8", invalid: :skip)
p m.gets_to_end
return the follwing error with 0.27.2
Syntax error in /home/maroo/open_source/ss.cr:3: unexpected token: Irf
s = Base64.decode "Irf+zvHG97K7vt+xuNa00NC4w8frx/PL+dDotcS5psTcIgo="
So this isn't an issue anymore, right?
A syntax error shouldn't be alright. But there isn't one: https://carc.in/#/r/6pr5
fixed for me in osx and linux.