I use this code to produce a jl file that has an invalid UTF-8 sequence in it:
ba = UInt8['a', 245, 'b']
s = String(ba)
f = open("foo.jl", "w")
println(f, "println('a')")
println(f, "# Hallo", s)
println(f, "println('c')")
close(f)
When I run this with julia foo.jl, I get this as output:
a
So it seems as if Julia crashes/exits when it encounters the invalid UTF-8.
My sense from a Slack conversation was that @StefanKarpinski thought that Julia should probably just detect these cases early on and show an error message.
It's not crashing; it's causing the newline to be skipped so the next line becomes part of the comment. Agreed it should be an error.
"Julia crashes/exits"
That's not good but what's the alternative? I like 8-bit encodings, like ISO8859-1, because every valid sequence is defined. Is that what we want (e.g. "html" below"), or do we want "fatal", pretty much the same thing as we do now, except do we really want include to throw an exception?
change illegal-UTF-8 handling to Unicode "best practice"
https://unicode-org.atlassian.net/browse/ICU-13311
"Since Unicode 6, the standard "recommends" a "best practice" (unusual & weak words in this context) that treats illegal UTF-8 sequences like sequences in other MBCS character sets, as a state machine would do.
This has been enshrined in the W3C Encoding Standard.
I predict that more and more code will be pushed to conform to that, or else code that's different will be avoided or worked around."
https://www.w3.org/TR/encoding/#utf-8-decoder
https://www.w3.org/TR/encoding/#error
Otherwise, if _result_ is error, switch on _mode_ and run the associated steps:
"replacement"
Push U+FFFD to output.
"html"
Prepend U+0026, U+0023, followed by the shortest sequence of ASCII digits representing result’s code point in base ten, followed by U+003B to input.
"fatal"
Return error.
[For us the "html" error handling seems more appropriate, as we do have a choice, than their default "replacement" (or possibly interpret illegal bytes as ISO8859-1, or even better Windows-1252?)]
Any invalid UTF-8 sequence should result in an immediate and clear error. The issue here is not that Julia should be more lenient, but that it should crash harder and more clearly.
Right; replacement characters are sometimes appropriate, but not here, since that would mean the program could silently do something different from what it looks like it does. I can understand wanting to support source files in ISO8859-1, but that's a different issue. Given that we're using UTF-8, invalid sequences are just a no-go.
Most helpful comment
Any invalid UTF-8 sequence should result in an immediate and clear error. The issue here is not that Julia should be more lenient, but that it should crash harder and more clearly.