Current handling of invalid UTF-8 strings is inconsistent between different string handling functions in Base.
I have seen similar discussions in the past, but they related to earlier implementations of strings so I sum up here what kinds of problems I see currently (sorry for spamming if this issue was decided earlier).
Here goes the example with the comments:
x = Vector{UInt8}("∀")
y1 = String(x[1:2]) # invalid string
y2 = y1*"1" # invalid string with appended valid character
length(y1) # returns 1
length(y2) # returns 2
endof(y1) # returns 1
endof(y2) # returns 3
next(y1, 1) # returns ('�', 2)
next(y2, 1) # returns ('↱', 4) and skips 3 where '1' is located
nextind(y1, 1) # returns 3
nextind(y2, 1) # returns 3 where '1' is located
collect(y1) # throws error
collect(y2) # returns two element array with UNDEFINED (usually 0) second entry
for c in y1 println(c) end # throws error
for c in y2 println(c) end # prints one character '↱'
[c for c in y1] # throws error
[c for c in y2] # returns two element array with UNDEFINED (usually 0) second entry
The reasons for those inconsistencies are:
is_valid_continuation
to check for valid indices (i.e. they assume that if something is not is_valid_continuation
it starts a new character);next
behaves in other way (and in particular not consistent with nextind
) as it tries in slow_utf8_next
to load the whole width of the character as indicated by utf8_trailing
not checking if they are valid continuations.This shows why collect(y2)
produces garbage in the second element. length(y2)
indicates that there are two characters, but next(y2, 1)
jumps to 4
beyond the end of y2
.
I assume that there is no canonic way to handle invalid UTF-8 (since they are invalid) but I would recommend use a consistent rule in all string handling functions. The least breaking change that would make it consistent would be to change next
so that it goes forward to the same point as nextind
and if the string is invalid always returns \ufffd
(now it may return garbage like next(y2, 1)[1]
).
Great analysis, thanks. That change to next
sounds ok but we'll have to be careful about performance.
I have made a PR in #24423 to show how the changed next
could look like.
Additionally I have identified a small problem with nextind
. If s
is a string where s[1]
is a continuation then currently nextind(s, 0)
returns 1
which is against the contract Get the next valid string index after i.
specified in the documentation. Similarly start(s)
returns 1
which is an invalid iteration state.
I do not know if we want to care too much about invalid strings, but in general they happen in practice.
So I have a plan for this that I've been meaning to implement for quite some time. The basic idea is to allow bytes that are not part of a valid character to be represented as special Char
values. It would still be an error to index into a valid character sequence except at the leading index, but indexing at an index that is not part of a valid UTF-8 character sequence would return an invalid Char
value representing that byte (not a valid code point). You could then even write such an invalid Char
value back out to a stream and get the byte value back. This has a number of benefits:
Being able to round-trip arbitrary binary data via reading and writing Char
values.
The result of writing an invalid String
to an I/O stream matches the result of iterating it as Char
values and the writing those to the I/O stream.
Not throwing errors when trying to deal with invalid String
data, which occurs quite often in the real world.
Converting an invalid Char
value to an integer type would still be an error since that requires a valid code point to return and an invalid Char
by definition does not correspond to a code point. But as long as you don't need the code point for an invalid Char
it's perfectly fine to deal with it.
It seems like it's time to give this approach another try.
@StefanKarpinski It would be great to have it for 1.0. Is there a place where the design of this is laid out (given I have already spent some time recently on strings I would gladly help working on it)?
The key question is how the boundaries of invalid Char
would be established.
What I mean by this is that between valid characters you can have almost anything of any length. For example would a sequence:
0xF0 0x82 0x82 0xAC
be a single invalid character (this is an example of an overlong encoding) or four invalid characters? (I am inclined to think it should be four invalid characters)
The key question is how the boundaries of invalid
Char
would be established.
The most straightforward approach is individual bytes. This is what many implementations already do – e.g. when you see mojibake in browsers, it is for individual invalid bytes. The most recent Unicode standard has some comment on this, which I believe says something at odds with this – I'll have to dig it up.
I can't find the reference to Unicode changing its recommendation on invalid encoding handling :(
The standard http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf on page 129 states:
Whenever an unconvertible offset is reached during conversion of a code unit sequence:
- The maximal subpart at that offset should be replaced by a single U+FFFD.
- The conversion should proceed at the offset immediately after the maximal subpart.
where well formed UTF-8 byte sequences are given by Table 3-7 there.
I would recommend to follow the standard as in my view it is important to be able to say that Julia 1.0 conforms in 100% to the standard.
This would mean:
UInt32
is always set to 0 in well formed characters);The consequences are that:
next
; it will be a bit slower than current next
but I believe that conformance is worth the cost;nextind
produces exactly the same traversal as next
;prevind
will be even more costly (as it will have to do tracebacks in corner cases);Char
interpretation (and conversion) and string printing has to be redefined;isvalid_unsafe
, endof
, length
, reverseind
, isvalid
, getindex
, search
for String
;Char
if they would probably handle invalid characters;Char
, next
, nextind
, prevind
, thisind
(the last if it is accepted from #24414); make sure that the contracts are consistent between String
and AbstractString
and the respective implementations; this might imply some additional cleanup (as recommended by @nalimilan in #24255) as current behavior of prevind
and nextind
is not 100% consistent between String
and AbstractString
on string boundaries;The bad thing is that (maybe except documentation - but this part is simple) all should be done probably in one big PR as the changes are interconnected.
And actually #24414 will be even more important if this proposal is implemented as probably isvalid(s::String, i)
should be defined as something like i>0 && thisind(s,i)==i
.
The reason is that it will be not possible to simply check using is_valid_continuation
if something is a valid index into a string (and thisind
will check against the case that something is a valid continuation but should be treated as a valid index as it is a part of some invalid byte sequence).
Ah, thanks for finding that! I'm a bit ambivalent about the Unicode 10 recommendation here since, e.g. when processing data that is a mix of UTF-8 and Latin-1 (which is fairly common), it will emit some replacement code points that are variable numbers of Latin-1 characters. I don't think the reasons they give for having a standard way to do this apply here, since this decoding will not interact with non-Julia systems. When writing an invalid character out, we should output the exact bytes that came in. This is at odds with "invalid characters are interpreted as U+FFFD in string processing": for comparison with other code points, we should use value equality, for output to a stream, we should just output the original bytes. In some sense, I'm not entirely sure that the subject of that section – i.e. "Best Practices for Using U+FFFD" – actually applies here. We're not actually replacing invalid data with '\uFFFD'
; we're allowing invalid data to pass through character-level processing without replacing it. If some recipient of that data chooses to interpret or print invalid sequences as '\uFFFD'
that's their business.
I also meant not to change everything invalid to '\uFFFD'
- just to convert invalid characters to this value e.g. when printing string. In the output stream the original bytes should be passed
Given your proposal the question is how would you convert such a string String([0xF1, 0x82, 0x82, 0x70])
to characters?
My understanding was that according to the Unicode standard the first three bytes should be treated as an invalid single character and the last byte is a valid 'p'
.
Given my earlier proposal to set a highest byte to indicate an invalid character it would be encoded as two characters: 0x80828270
and 0x00000070
. This is OK, as any invalid character sequence will have not more than 3 bytes.
Then on conversion back to bytes we can check any Char
against 0x80000000
mask to check if it is valid. If it is valid then we decode it to UTF-8. If it is not then we can simply read back the original invalid bytes (in this case 3 bytes - but in general it is easy to check if it should be 1, 2 or 3 bytes as 0x00 is a well formed UTF-8 so it will not appear in invalid sequences).
And e.g. when printing characters we check Char
against 0x80000000
mask and if it is invalid we print '\uFFFD'
.
How would encoding of invalid UTF-8 byte sequence String([0xF1, 0x82, 0x82, 0x70])
work under your proposal? I understand it would be 4 characters, but what would be their representation?
I also meant not to change everything invalid to
'\uFFFD'
- just to convert invalid characters to this value e.g. when printing string. In the output stream the original bytes should be passed
So you mean when printing a quoted version of the string? In that situation it seems better to print it using \x
escapes in a way that reproduces the same invalid string, e.g.: "\xf1\x82\x82\x70"
to use your example of an invalid string. Otherwise printing and outputting to stream are one and the same.
Given your proposal the question is how would you convert such a string
String([0xF1, 0x82, 0x82, 0x70])
to characters?
As three separate invalid characters followed by 'p'
– one for character for each byte.
My understanding was that according to the Unicode standard the first three bytes should be treated as an invalid single character and the last byte is a valid
'p'
.
The standard specifies how to do replacement of invalid encodings with '\ufffd'
, the replacement character. But that's not really what we're doing here, so it's not entirely clear whether the standard applies or not. If you consider the proposed invalid Char
values to be a kind of data-preserving proxy for '\ufffd'
then perhaps it does, but if you don't, then it doesn't really apply. The reason they give for standardizing on one approach is "To promote interoperability in the implementation of conversion processes, the Unicode Standard recommends a particular best practice." However, I don't think interoperability really applies here since the invalid Char
values will never be visible outside of Julia itself. I suppose, however, that derived quantities could be visible – e.g. string length. If you count characters, then an implementation following the standard and counting '\ufffd'
as a single character would get one character count, whereas if we do bytes instead then you'd get a different character count. Perhaps that's a good enough reason to follow the standard.
Given my earlier proposal to set a highest byte to indicate an invalid character it would be encoded as two characters:
0x80828270
and0x00000070
. This is OK, as any invalid character sequence will have not more than 3 bytes.
Then on conversion back to bytes we can check anyChar
against0x80000000
mask to check if it is valid. If it is valid then we decode it to UTF-8. If it is not then we can simply read back the original invalid bytes (in this case 3 bytes - but in general it is easy to check if it should be 1, 2 or 3 bytes as 0x00 is a well formed UTF-8 so it will not appear in invalid sequences).
Sure, that's certainly an option. We may want to consider whether we can represent invalid data in other encodings as well, but perhaps that's a step too far. Invalid UTF-16 code units are easy enough since that can only be a single pair of bytes, but invalid UTF-32 code units are a little trickier since they can be arbitrarily large.
And e.g. when printing characters we check
Char
against0x80000000
mask and if it is invalid we print'\uFFFD'
.
See above. I'm not sure what you consider "printing" to be here.
How would encoding of invalid UTF-8 byte sequence
String([0xF1, 0x82, 0x82, 0x70])
work under your proposal? I understand it would be 4 characters, but what would be their representation?
The representation I'm considering is zero-padded UTF-8 data in 32-bit chunks. That means that you always just print 1 low byte and 2-3 additional non-zero bytes. Decoding becomes entirely a matter of deciding where to break the input stream into chunks.
Amusingly, I just hit a case where the Unicode 10 behavior would be much easier to implement 😂
I like your approach and I think it is better than mine.
If I understand it correctly Char
would hold UTF-8 encoded values and they would be decoded to Unicode code point only when needed. Yes?
The question how to treat invalid UTF-8 is orthogonal - i.e. no matter what approach is chosen (Unicode 10 approach or single byte approach) both can be safely implemented using the new Char
model you have proposed.
My thinking is that following Unicode 10 approach should be preferable unless it introduces some significant performance penalty. But I believe it might be even less costly because with single byte approach if you are running next
and find that 3rd byte is invalid you have to go back to the first byte; on the next run of next
you start with the 2nd byte; whereas in Unicode 10 approach actually you put all that you have read so far into a single Char
and there is no need to trace back.
Here's some work in progress for my proposed Char
representation: https://github.com/JuliaLang/julia/pull/24439.
Agree that the handling of invalid UTF-8 is orthogonal. And, yes, the Unicode 10 approach only requires a single byte of lookahead which is considerably easier to deal with.
@StefanKarpinski When I run my benchmark given above I get the following problem (reduced only to what is relevant):
x = Vector{UInt8}("∀")
y1 = String(x[1:2]) # invalid string
y2 = y1*"1" # invalid string with appended valid character
next(y2, 1) # returns ('\xe2\x88', 3) as expected
nextind(y2, 1) # returns 4 which is inconsistent
is this intended (I suppose not)?
Yup, that's a bug – thanks for testing and catching it! I've got a fix coming. Next week, after the feature freeze, I'm going to write a comprehensive test suite for all these string functions (and write faster versions of multistep nextind
and prevind
for String
); I didn't have time to get the test suite done before feature freeze. We have lots of tests for valid UTF-8 strings but almost none for invalid UTF-8 strings since we just barfed on them previously.
closing this as it is fixed now.
Most helpful comment
closing this as it is fixed now.