Although the docs say "Char" or "String"
Returns true if the given value is valid for that type. Types currently can be either Char or String.
SubStrings should be accepted as well, but they aren't.
isvalid(lstrip(" blabla"))
ERROR: MethodError: no method matching isvalid(::SubString{String})
Closest candidates are:
isvalid(::SubString, ::Integer) at strings/types.jl:70
isvalid(::AbstractString, ::Integer) at strings/basic.jl:228
isvalid(::Char) at strings/utf8proc.jl:56
Looks pretty easy to extend the current code to handle this. Just widen the signature of byte_string_classify(s::String) to
byte_string_classify(s::Union{String,SubString{String},Vector{UInt8}}) =
ccall(:u8_isvalid, Int32, (Ptr{UInt8}, Int), s, sizeof(s))
deleting the redundant byte_string_classify(data::Vector{UInt8}) method,
and then widen the signature of isvalid(::Type{String}, s::Union{Vector{UInt8},String}) to allow SubString{String}.
If you add it, please also add a test.
In general, it looks like substrings are pretty broken for invalid strings:
julia> s = String(UInt8[0xfe, 0x80, 0x80, 0x80, 0x80, 0x80])
6-byte String of invalid UTF-8 data:
0xfe
0x80
0x80
0x80
0x80
0x80
julia> ss = SubString(s, 1, 2)
"\U80000000"
julia> String(ss)
6-byte String of invalid UTF-8 data:
0xfe
0x80
0x80
0x80
0x80
0x80
@bkamins ^
@stevengj Are you testing this on latest? The result you give is present on 0.6.
Now it should be fixed and ss = SubString(s, 1, 2) should throw BoundsError (if it does not please let me know as this is the expected behavior).
Indeed, it's great that your recent changes fixed this kind of bug:
julia> s = String(UInt8[0xfe, 0x80, 0x80, 0x80, 0x80, 0x80])
6-byte String of invalid UTF-8 data:
0xfe
0x80
0x80
0x80
0x80
0x80
julia> ss = SubString(s, 1, 2)
ERROR: BoundsError: attempt to access 6-byte String of invalid UTF-8 data:
0xfe
0x80
0x80
0x80
0x80
0x80
at index [2]
Stacktrace:
[1] SubString{String}(::String, ::Int64, ::Int64) at ./strings/types.jl:33
[2] SubString(::String, ::Int64, ::Int64) at ./strings/types.jl:38
Bump
You're welcome to work on it, but given that it's a small improvement which can be added in any 1.x release, it may not get attention from core developers in the short term.
But isn't this fixed by #24132? It only needs to be merged, I think.
Ah, you mean https://github.com/JuliaLang/julia/pull/25285? I've rebased it, let's see how it goes. FWIW, better bump the PR than the issue next time.
Is this issue still open? I would like to work on this issue.
@raghav9-97 You could start from https://github.com/JuliaLang/julia/pull/25285 and check what's needed to fix the failures.
Thanks for the reference @nalimilan
Most helpful comment
Looks pretty easy to extend the current code to handle this. Just widen the signature of
byte_string_classify(s::String)todeleting the redundant
byte_string_classify(data::Vector{UInt8})method,and then widen the signature of
isvalid(::Type{String}, s::Union{Vector{UInt8},String})to allowSubString{String}.