ASCIIString, UTF8String and ByteString into String, #16058[x] Compat: add String, https://github.com/JuliaLang/Compat.jl/pull/192
[x] Base: replace utf8, bytestring and with stringString, #16453, #16469
[x] Base: replace s = ascii(s) with s = String(s); isascii(s) || error(...), #16396
[x] Base: cleanup conversion mess to and from string types #16470, #16713, #16731
Cwstring #16975, #16974[x] Base: remove UTF16String and UTF32String #16590
[x] figure out LineEdit test change, https://github.com/JuliaLang/julia/pull/16058#discussion_r61192341, fix: https://github.com/JuliaLang/julia/pull/16198
String inner constructor, https://github.com/JuliaLang/julia/pull/16058#discussion_r61228065String inner constructor, https://github.com/JuliaLang/julia/pull/16058#issuecomment-217059003jl_is_utf8_string and jl_is_byte_string with jl_is_string, https://github.com/JuliaLang/julia/pull/16058#discussion_r61232464ascii in uv_getaddrinfo to error on non-ASCII domain names, https://github.com/JuliaLang/julia/pull/16058#discussion_r61252081ascii in SuiteSparse to error on non-ASCII inputs,String type, https://github.com/JuliaLang/julia/commit/5de52cf9c9343cfcf50be4c7c736290d3f985961#commitcomment-17373792, https://github.com/JuliaLang/julia/pull/16221readdlm's ignore_invalid_chars option, https://github.com/JuliaLang/julia/pull/16058#discussion_r61190549, e95f5f28292a805997603e4f351da9989066c4f8readdlm's ignore_invalid_chars option 969d61b2e6d393aad3328aa137174ac3f99ae65fconnect in docs, doc/manual/interacting-with-julia.rst (with doctests), https://github.com/JuliaLang/julia/pull/16058#discussion_r61525930Char representation (allow lossless string processing of any data)RepString (moved to LegacyStrings)RevString (move to package?)[ ] Base: merge SubString and String (add offset field to String)
[ ] make prevind("ll", 5) and such errors, https://github.com/JuliaLang/julia/pull/16058#issuecomment-216957933
https://github.com/JuliaLang/julia/pull/16058#discussion_r61231943
isspace implementation, https://github.com/JuliaLang/julia/pull/16058#discussion_r61231737convert method breaks bootstrap, https://github.com/JuliaLang/julia/pull/16058#discussion_r61231162takebuf API, https://github.com/JuliaLang/julia/pull/16058#discussion_r61229502, https://github.com/JuliaLang/julia/pull/19088Great list.
What about windows APIs that use utf-16?
What about windows APIs that use utf-16?
Already taken care of: https://github.com/JuliaLang/julia/pull/15033.
Two more potential rounds:
RepString and RevStringSubString and StringMight not get to those until the next release though.
What about:
I definitely think there's a lot to play with around merging Substring and String, but it certainly feels like 0.6 material.
It was just added, but now that readstring(io) is just String(read(io)) it maybe doesn't even need a separate name forever
just for my curiosity: Why should this be part of 0.5? The title of the release had something to do with Arrays ... not Strings?
+1 for including as much of this plan as possible into 0.5. Breakage better happen soon.
As regards moving ASCIIString, etc. to a package (round 3, bullet 2), see discussion on an implementation for any encoding in StringEncodings.jl here.
As regards changing Char's underlying representation(round 4, bullet 1), another step would also be to introduce AbstractChar and use it in method signatures to allow e.g. the ASCIIString replacement to implement ASCIIChar <: AbstractChar as a UInt8. This would allow people working with ASCII data to actually enjoy higher performance than before.
EDIT: Finally, I'd like to see a discussion regarding the opportunity of ensuring that String only holds valid UTF-8 data or not, and how to handle file paths (special string type or not). But I don't want to derail this already rich thread, so maybe better open a separate issue for that?
just for my curiosity: Why should this be part of 0.5? The title of the release had something to do with Arrays ... not Strings?
Because I've been working on this for months and it's ready to go. It won't hold up the release anyway.
I somehow agree, it will not hold the release of the language. But more syntax changes in 0.5 give some impact in porting packages to 0.5; maybe i'm just wrong that this causes effort ...
I somehow agree, it will not hold the release of the language. But more syntax changes in 0.5 give some impact in porting packages to 0.5; maybe i'm just wrong that this causes effort ...
And more syntax changes in 0.6 will cause effort for even more packages. And given the growth rate of the package ecosystem...
True – added to the roadmap.
Actually, question here: do people think we should keep ascii and have it convert strings to standard String type and error if the content is not plain ASCII? It's kind of a useful function to have.
Another question about behavior. String and string don't actually behave in the same manner always:
julia> String(UInt8[97,98,99])
"abc"
julia> string(UInt8[97,98,99])
"UInt8[97,98,99]"
Any thoughts on resolving this? Currently utf8 and ascii behave like String not like string.
Maybe String! if it takes ownership of the array, i.e. is "in-place"?
There is also the bytestring function, whose name seems like a holdover from ByteString, but whose function is essential.
I'm pretty sure that usage of ! does not have the @JeffBezanson seal of approval.
String¡, then.
This case doesn't bother me that much --- unlike utf8 and ascii, string is not an encoding.
The point is, we need some function to replace bytestring(::Vector{UInt8}) and bytestring(::Ptr{UInt8}, [len]) (makes a copy of bytes), but also UTF8String(::Vector{UInt8}) and pointer_to_string (doesn't make a copy).
It makes sense to me to name them all the same thing, with ! for the "in-place" versions. But I don't know what that name should be. Maybe bytestring and bytestring!, where byte refers to the encoding? Or utf8 and utf8! to be more explicit about the encoding, with ascii doing an additional assert?
Can we just keep bytestring?
@JeffBezanson, bytestring is fine if a bit vague, but it always makes a copy. Would you be okay with bytestring! for the non-copying version?
I'm not prepared to make this the _first ever_ case where ! means something other than mutation.
Then what do we call the non-copying version(s) of bytestring?
I've probably missed some of the discussion here, but can String have the same constructors that UTF8String had? Plus one with a Ptr argument?
@JeffBezanson, the problem is that then it can't replace string.
That's fine with me. Somehow we need one function that wraps a UInt8 vector as a string, and another that gives you the output of print as a string. We could rename string to something like sprint, except that's taken.
It's a little weird if String and string do almost all the same things – except if called on a byte array.
Namewise, yes it's confusing, but functionally they don't seem all that similar to me:
julia> String("a", "b", "c")
ERROR: MethodError: no method matching String(::String, ::String, ::String)
Closest candidates are:
String{T}(::Any)
@StefanKarpinski, I thought @JeffBezanson's proposal was that String should _only_ construct a String from a byte array (and maybe from another AbstractString), and that if you want an arbitrary object's string representation you should continue to call string. Hence they will do entirely different things.
Yes.
bytestring still seems to be the odd man out. The proposal is that: string(x...) makes a string representation of x..., String(a) makes a string out of a byte array (in-place), and bytestring(a) makes a string out of a byte array (out-of-place)?
I feel like this has come up in a few other cases (conversion vs. construction). I think it'd be good to separate the two with different methods here:
String(::Vector{UInt8}) => constructs String from byte vector
String(::Ptr{UInt8}, len, copy::Bool=true) => constructs String from pointer + len, making copy by default
# conversions; using `string` or perhaps a slightly more distinct `tostring`
tostring(::Vector{UInt}) => string representation of the byte vector
#etc.
@stevengj Yes. That arrangement is relatively non-breaking. But it looks like we could perhaps merge bytestring and pointer_to_string?
string isn't really a conversion --- in the sense of convert, at least. An accurate name would be something like print_to_string, which is just too long.
The proposal is that:
string(x...)makes a string representation ofx...,String(a)makes a string out of a byte array (in-place), andbytestring(a)makes a string out of a byte array (out-of-place)?
In that case, we could just spell bytestring(a) as String(copy(a)).
That's true, and the remaining cases are pointers that can be handled by pointer_to_string (which seems to be undocumented BTW).
What about when you want to convert an AbstractString of some other type to a String. Do we write that as String(s) or convert(String, s) or what? I think that @quinnj is proposing that pointer_to_string be replaced by String(p::Ptr{UInt8}, len=strlen(p); copy::Bool=true).
I actually left copy as a positional argument rather than keyword, since this might be a performance-sensitive function (i.e. CSV, ODBC, SQLite getting data from other C libraries as pointers).
Sure. That seems ok. The next function improvement we need seems to be making keywords faster.
For pointers, I would prefer just having a String(p::Ptr{UInt8}, len=cstrlen(p)) method that makes a copy, analogous to bytestring(p, len) now. bytestring(ptr) is extremely common (probably more common than bytestring(array), because of its utility in calling C functions, so we should try to replace it with something simple and efficient like String(ptr) rather than the slower and more cumbersome String(copy(pointer_to_string(ptr)).
I implemented pointer_to_string for some internal stuff and left it undocumented out of caution (maybe I shouldn't have exported it?), because it is somewhat "unsafe" (analogous to pointer_to_array). Maybe we should leave it as-is, as it parallels pointer_to_array, and decide whether to document it. I don't think we should encourage casual use by including it in the String constructor.
Summary:
string(args...) ⟶ string(args...)bytestring(array) ⟶ String(copy(array))bytestring(ptr, [len]) ⟶ String(ptr, [len])pointer_to_string ⟶ pointer_to_string, analogous to pointer_to_arrayI thing String(s::AbstractString) should work; no reason to force the user to call convert here.
Ok, it seems like we have a plan here:
String(a::Vector{UInt8}) takes "ownership" of aString(s::String) is just the identity?String(s::AbstractString) converts strings of other types to StringString(p::Ptr{UInt8}, len::Integer=strlen(p)) from a byte pointer, copies datastring(x...) stringifies (via print) its arguments and concatenates to a single Stringbytestring is deprecated in favor of String, #16453utf8 is deprecated in favor of String, #16469ascii remains and converts to String, erroring if the data is non-ASCII, #16396The only thing I'm wondering is what about a function for concatenating to a specific type of string (not String)Â and/or concatenating strings of the same type to get a string of that same type? This API doesn't seem to include anything for that. I guess we could define string to support that?
I would really discourage you from adding a copy argument to the String constructor (in which case you would also need an owns argument to fully replicate pointer_to_string); people should _not_ use this feature casually. It is on the same footing, safety-wise, as pointer_to_array, which we carefully segregate into the "unsafe pointer games" section of the manual. We aren't merging pointer_to_array into the Array constructor, after all.
That's probably a good call. There was talk once of doing an Unsafe module; where did that end up? Worth giving a shot?
So no methods of String to construct strings from pointers?
I think we would just have String(p::Ptr{UInt8}, len::Integer=strlen(p)), that would always make a copy and pointer_to_string that didn't make a copy.
I think we would just have String(p::Ptr{UInt8}, len::Integer=strlen(p)), that would always make a copy and pointer_to_string that didn't make a copy.
If we keep pointer_to_string(), then let's move all pointer-related operations to it. I don't like the idea of mixing these low-level operations with the standard String constructor.
Regarding string(), I'm afraid we'll have to merge it with String() as well for consistency with Symbol. Indeed, with the recent changes, Symbol(1, 2) parallels string(1, 2), not String(1, 2). But then I don't know how to handle String(a::Vector{UInt8}); maybe via convert?
@nalimilan, bytestring(ptr, [len]) (which makes a copy) has been used without problems for years now and is extremely common in ccall-using code; if the bytestring function goes away, it makes sense to keep this functionality easily accessible (e.g.) in the String constructor. pointer_to_string is in a completely different category — like pointer_to_array, it is extremely unsafe if used casually, so it makes sense to segregate it into the "unsafe" section of the manual. The two cases aren't comparable at all.
One option would be to make string(...) -> String(...), and use either bytestring or utf8 to construct a String from Array{UInt8} or Ptr{UInt8}, on the theory that the latter cases require you to explicitly specify an encoding.
Symbol(1, 2)parallelsstring(1, 2), notString(1, 2).
It seems a little weird to me that we have this method at all.
It seems a little weird to me that we have this method at all.
@StefanKarpinski I think the most common use of it is Symbol("s", i) with i an integer (see the diff at https://github.com/JuliaLang/julia/pull/16154). It can easily be replaced with Symbol(string("s", i)), though it's a bit more verbose.
@stevengj If we indeed keep string() and use String only to construct a String from a series of bytes or from another string, then I'm OK with it accepting pointers too.
Or Symbol("$s$i") or Symbol(s*i) or if https://github.com/JuliaLang/julia/issues/9945 happens, you could even write :"$s$i", assuming we decide to support interpolation in that kind of symbol literal.
Hi @StefanKarpinski ! How are this changes going to affect string indexing?
There are some formats (i.e. PDB) where some values are at determined indexes (i.e. chain identifier in the column 22 and residue name in 18:20). Would it be safe to do line[22] in the new String type?
The current work doesn't change anything really – it just essentially renames UTF8String to String and gets rid of ASCIIString.
I raised this in #16396 but didn't get a response there, want to make sure we decide something. It's related to a few of the checkboxes.
For invalid ascii (and later, invalid utf-8) do we want to throw a more specific exception type than ArgumentError? We have UnicodeError now, but I suspect that will be significantly refactored when UTF16 and UTF32 types get moved out of base.
UnicodeError is going away, so no, this is just an ArgumentError.
That's also the least change since it's what the ascii function threw previously.
Well we are in the midst of changing a lot of other things here. IMO it would be worth being more granular with the exception types, so the reporting will be more specific. This isn't one that you're likely to want to catch since it would be better to branch on isascii, but you may still want to deal with it when ascii throws deep inside some API you don't have control over.
To be perfectly honest, worrying about more granular exception types with all this other stuff to deal with in this series of changes is more than I care to think about. If you have a coherent plan for string exception types, feel free to write it up.
TODO: Cstring and Cwstring should ensure that string data is NUL terminated (as well as NUL free).
Ref https://github.com/JuliaLang/julia/issues/16499. Also need a way to express conversion of String to non-NUL-terminated UTF-16 data with known length. Perhaps convert(Vector{UInt16}, s)?
@StefanKarpinski, in previous incarnations, any UInt8 array allocated by Julia was automatically NUL-terminated internally; is this no longer the case? And UTF16String and UTF32String were NUL-terminated, so convert(Cwstring, string) was also.
Which items are slated for 0.5? Through round 3, IIRC?
Yes, that's correct. Tomorrow/Wednesday I need to create a LegacyStrings package, put all the Unicode stuff in it and then merge my PR that removes all of that stuff with deprecations that point at it.
PR is up to remove RepString; it has already been added to LegacyStrings. RevString is used by some Base functions so we might want to leave it for now. Anything else here planned for 0.6?
While you're doing string stuff, it's probably not too hard to just actually do utf-8 reversal on strings instead of using the RevString type – that would advance the highlander agenda just a bit further. (If you feel like it and have some spare type while waiting for type revamp test to run or something.)
Other than removing RevString everything that's likely to be done is already done.
Most helpful comment
Other than removing
RevStringeverything that's likely to be done is already done.