ASCIIString
, UTF8String
and ByteString
into String
, #16058[x] Compat: add String
, https://github.com/JuliaLang/Compat.jl/pull/192
[x] Base: replace utf8
, bytestring
and with string
String
, #16453, #16469
[x] Base: replace s = ascii(s)
with s = String(s); isascii(s) || error(...)
, #16396
[x] Base: cleanup conversion mess to and from string types #16470, #16713, #16731
Cwstring
#16975, #16974[x] Base: remove UTF16String
and UTF32String
#16590
[x] figure out LineEdit
test change, https://github.com/JuliaLang/julia/pull/16058#discussion_r61192341, fix: https://github.com/JuliaLang/julia/pull/16198
String
inner constructor, https://github.com/JuliaLang/julia/pull/16058#discussion_r61228065String
inner constructor, https://github.com/JuliaLang/julia/pull/16058#issuecomment-217059003jl_is_utf8_string
and jl_is_byte_string
with jl_is_string
, https://github.com/JuliaLang/julia/pull/16058#discussion_r61232464ascii
in uv_getaddrinfo
to error on non-ASCII domain names, https://github.com/JuliaLang/julia/pull/16058#discussion_r61252081ascii
in SuiteSparse to error on non-ASCII inputs,String
type, https://github.com/JuliaLang/julia/commit/5de52cf9c9343cfcf50be4c7c736290d3f985961#commitcomment-17373792, https://github.com/JuliaLang/julia/pull/16221readdlm
's ignore_invalid_chars
option, https://github.com/JuliaLang/julia/pull/16058#discussion_r61190549, e95f5f28292a805997603e4f351da9989066c4f8readdlm
's ignore_invalid_chars
option 969d61b2e6d393aad3328aa137174ac3f99ae65fconnect
in docs, doc/manual/interacting-with-julia.rst
(with doctests), https://github.com/JuliaLang/julia/pull/16058#discussion_r61525930Char
representation (allow lossless string processing of any data)RepString
(moved to LegacyStrings)RevString
(move to package?)[ ] Base: merge SubString
and String
(add offset field to String
)
[ ] make prevind("ll", 5)
and such errors, https://github.com/JuliaLang/julia/pull/16058#issuecomment-216957933
https://github.com/JuliaLang/julia/pull/16058#discussion_r61231943
isspace
implementation, https://github.com/JuliaLang/julia/pull/16058#discussion_r61231737convert
method breaks bootstrap, https://github.com/JuliaLang/julia/pull/16058#discussion_r61231162takebuf
API, https://github.com/JuliaLang/julia/pull/16058#discussion_r61229502, https://github.com/JuliaLang/julia/pull/19088Great list.
What about windows APIs that use utf-16?
What about windows APIs that use utf-16?
Already taken care of: https://github.com/JuliaLang/julia/pull/15033.
Two more potential rounds:
RepString
and RevString
SubString
and String
Might not get to those until the next release though.
What about:
I definitely think there's a lot to play with around merging Substring and String, but it certainly feels like 0.6 material.
It was just added, but now that readstring(io)
is just String(read(io))
it maybe doesn't even need a separate name forever
just for my curiosity: Why should this be part of 0.5? The title of the release had something to do with Arrays ... not Strings?
+1 for including as much of this plan as possible into 0.5. Breakage better happen soon.
As regards moving ASCIIString
, etc. to a package (round 3, bullet 2), see discussion on an implementation for any encoding in StringEncodings.jl here.
As regards changing Char
's underlying representation(round 4, bullet 1), another step would also be to introduce AbstractChar
and use it in method signatures to allow e.g. the ASCIIString
replacement to implement ASCIIChar <: AbstractChar
as a UInt8
. This would allow people working with ASCII data to actually enjoy higher performance than before.
EDIT: Finally, I'd like to see a discussion regarding the opportunity of ensuring that String
only holds valid UTF-8 data or not, and how to handle file paths (special string type or not). But I don't want to derail this already rich thread, so maybe better open a separate issue for that?
just for my curiosity: Why should this be part of 0.5? The title of the release had something to do with Arrays ... not Strings?
Because I've been working on this for months and it's ready to go. It won't hold up the release anyway.
I somehow agree, it will not hold the release of the language. But more syntax changes in 0.5 give some impact in porting packages to 0.5; maybe i'm just wrong that this causes effort ...
I somehow agree, it will not hold the release of the language. But more syntax changes in 0.5 give some impact in porting packages to 0.5; maybe i'm just wrong that this causes effort ...
And more syntax changes in 0.6 will cause effort for even more packages. And given the growth rate of the package ecosystem...
True – added to the roadmap.
Actually, question here: do people think we should keep ascii
and have it convert strings to standard String
type and error if the content is not plain ASCII? It's kind of a useful function to have.
Another question about behavior. String
and string
don't actually behave in the same manner always:
julia> String(UInt8[97,98,99])
"abc"
julia> string(UInt8[97,98,99])
"UInt8[97,98,99]"
Any thoughts on resolving this? Currently utf8
and ascii
behave like String
not like string
.
Maybe String!
if it takes ownership of the array, i.e. is "in-place"?
There is also the bytestring
function, whose name seems like a holdover from ByteString
, but whose function is essential.
I'm pretty sure that usage of !
does not have the @JeffBezanson seal of approval.
String¡
, then.
This case doesn't bother me that much --- unlike utf8
and ascii
, string
is not an encoding.
The point is, we need some function to replace bytestring(::Vector{UInt8})
and bytestring(::Ptr{UInt8}, [len])
(makes a copy of bytes), but also UTF8String(::Vector{UInt8})
and pointer_to_string
(doesn't make a copy).
It makes sense to me to name them all the same thing, with !
for the "in-place" versions. But I don't know what that name should be. Maybe bytestring
and bytestring!
, where byte
refers to the encoding? Or utf8
and utf8!
to be more explicit about the encoding, with ascii
doing an additional assert?
Can we just keep bytestring
?
@JeffBezanson, bytestring
is fine if a bit vague, but it always makes a copy. Would you be okay with bytestring!
for the non-copying version?
I'm not prepared to make this the _first ever_ case where !
means something other than mutation.
Then what do we call the non-copying version(s) of bytestring
?
I've probably missed some of the discussion here, but can String
have the same constructors that UTF8String
had? Plus one with a Ptr argument?
@JeffBezanson, the problem is that then it can't replace string
.
That's fine with me. Somehow we need one function that wraps a UInt8 vector as a string, and another that gives you the output of print
as a string. We could rename string
to something like sprint
, except that's taken.
It's a little weird if String
and string
do almost all the same things – except if called on a byte array.
Namewise, yes it's confusing, but functionally they don't seem all that similar to me:
julia> String("a", "b", "c")
ERROR: MethodError: no method matching String(::String, ::String, ::String)
Closest candidates are:
String{T}(::Any)
@StefanKarpinski, I thought @JeffBezanson's proposal was that String
should _only_ construct a String
from a byte array (and maybe from another AbstractString
), and that if you want an arbitrary object's string representation you should continue to call string
. Hence they will do entirely different things.
Yes.
bytestring
still seems to be the odd man out. The proposal is that: string(x...)
makes a string representation of x...
, String(a)
makes a string out of a byte array (in-place), and bytestring(a)
makes a string out of a byte array (out-of-place)?
I feel like this has come up in a few other cases (conversion vs. construction). I think it'd be good to separate the two with different methods here:
String(::Vector{UInt8}) => constructs String from byte vector
String(::Ptr{UInt8}, len, copy::Bool=true) => constructs String from pointer + len, making copy by default
# conversions; using `string` or perhaps a slightly more distinct `tostring`
tostring(::Vector{UInt}) => string representation of the byte vector
#etc.
@stevengj Yes. That arrangement is relatively non-breaking. But it looks like we could perhaps merge bytestring
and pointer_to_string
?
string
isn't really a conversion --- in the sense of convert
, at least. An accurate name would be something like print_to_string
, which is just too long.
The proposal is that:
string(x...)
makes a string representation ofx...
,String(a)
makes a string out of a byte array (in-place), andbytestring(a)
makes a string out of a byte array (out-of-place)?
In that case, we could just spell bytestring(a)
as String(copy(a))
.
That's true, and the remaining cases are pointers that can be handled by pointer_to_string
(which seems to be undocumented BTW).
What about when you want to convert an AbstractString of some other type to a String
. Do we write that as String(s)
or convert(String, s)
or what? I think that @quinnj is proposing that pointer_to_string
be replaced by String(p::Ptr{UInt8}, len=strlen(p); copy::Bool=true)
.
I actually left copy
as a positional argument rather than keyword, since this might be a performance-sensitive function (i.e. CSV, ODBC, SQLite getting data from other C libraries as pointers).
Sure. That seems ok. The next function improvement we need seems to be making keywords faster.
For pointers, I would prefer just having a String(p::Ptr{UInt8}, len=cstrlen(p))
method that makes a copy, analogous to bytestring(p, len)
now. bytestring(ptr)
is extremely common (probably more common than bytestring(array)
, because of its utility in calling C functions, so we should try to replace it with something simple and efficient like String(ptr)
rather than the slower and more cumbersome String(copy(pointer_to_string(ptr))
.
I implemented pointer_to_string
for some internal stuff and left it undocumented out of caution (maybe I shouldn't have exported it?), because it is somewhat "unsafe" (analogous to pointer_to_array
). Maybe we should leave it as-is, as it parallels pointer_to_array
, and decide whether to document it. I don't think we should encourage casual use by including it in the String
constructor.
Summary:
string(args...)
⟶ string(args...)
bytestring(array)
⟶ String(copy(array))
bytestring(ptr, [len])
⟶ String(ptr, [len])
pointer_to_string
⟶ pointer_to_string
, analogous to pointer_to_array
I thing String(s::AbstractString)
should work; no reason to force the user to call convert
here.
Ok, it seems like we have a plan here:
String(a::Vector{UInt8})
takes "ownership" of a
String(s::String)
is just the identity?String(s::AbstractString)
converts strings of other types to String
String(p::Ptr{UInt8}, len::Integer=strlen(p))
from a byte pointer, copies datastring(x...)
stringifies (via print) its arguments and concatenates to a single String
bytestring
is deprecated in favor of String
, #16453utf8
is deprecated in favor of String
, #16469ascii
remains and converts to String
, erroring if the data is non-ASCII, #16396The only thing I'm wondering is what about a function for concatenating to a specific type of string (not String
)Â and/or concatenating strings of the same type to get a string of that same type? This API doesn't seem to include anything for that. I guess we could define string
to support that?
I would really discourage you from adding a copy
argument to the String
constructor (in which case you would also need an owns
argument to fully replicate pointer_to_string
); people should _not_ use this feature casually. It is on the same footing, safety-wise, as pointer_to_array
, which we carefully segregate into the "unsafe pointer games" section of the manual. We aren't merging pointer_to_array
into the Array
constructor, after all.
That's probably a good call. There was talk once of doing an Unsafe
module; where did that end up? Worth giving a shot?
So no methods of String
to construct strings from pointers?
I think we would just have String(p::Ptr{UInt8}, len::Integer=strlen(p))
, that would always make a copy and pointer_to_string
that didn't make a copy.
I think we would just have String(p::Ptr{UInt8}, len::Integer=strlen(p)), that would always make a copy and pointer_to_string that didn't make a copy.
If we keep pointer_to_string()
, then let's move all pointer-related operations to it. I don't like the idea of mixing these low-level operations with the standard String
constructor.
Regarding string()
, I'm afraid we'll have to merge it with String()
as well for consistency with Symbol
. Indeed, with the recent changes, Symbol(1, 2)
parallels string(1, 2)
, not String(1, 2)
. But then I don't know how to handle String(a::Vector{UInt8})
; maybe via convert
?
@nalimilan, bytestring(ptr, [len])
(which makes a copy) has been used without problems for years now and is extremely common in ccall
-using code; if the bytestring
function goes away, it makes sense to keep this functionality easily accessible (e.g.) in the String
constructor. pointer_to_string
is in a completely different category — like pointer_to_array
, it is extremely unsafe if used casually, so it makes sense to segregate it into the "unsafe" section of the manual. The two cases aren't comparable at all.
One option would be to make string(...) -> String(...)
, and use either bytestring
or utf8
to construct a String
from Array{UInt8}
or Ptr{UInt8}
, on the theory that the latter cases require you to explicitly specify an encoding.
Symbol(1, 2)
parallelsstring(1, 2)
, notString(1, 2)
.
It seems a little weird to me that we have this method at all.
It seems a little weird to me that we have this method at all.
@StefanKarpinski I think the most common use of it is Symbol("s", i)
with i
an integer (see the diff at https://github.com/JuliaLang/julia/pull/16154). It can easily be replaced with Symbol(string("s", i))
, though it's a bit more verbose.
@stevengj If we indeed keep string()
and use String
only to construct a String
from a series of bytes or from another string, then I'm OK with it accepting pointers too.
Or Symbol("$s$i")
or Symbol(s*i)
or if https://github.com/JuliaLang/julia/issues/9945 happens, you could even write :"$s$i"
, assuming we decide to support interpolation in that kind of symbol literal.
Hi @StefanKarpinski ! How are this changes going to affect string indexing?
There are some formats (i.e. PDB) where some values are at determined indexes (i.e. chain identifier in the column 22 and residue name in 18:20). Would it be safe to do line[22]
in the new String
type?
The current work doesn't change anything really – it just essentially renames UTF8String
to String
and gets rid of ASCIIString
.
I raised this in #16396 but didn't get a response there, want to make sure we decide something. It's related to a few of the checkboxes.
For invalid ascii (and later, invalid utf-8) do we want to throw a more specific exception type than ArgumentError? We have UnicodeError now, but I suspect that will be significantly refactored when UTF16 and UTF32 types get moved out of base.
UnicodeError is going away, so no, this is just an ArgumentError.
That's also the least change since it's what the ascii
function threw previously.
Well we are in the midst of changing a lot of other things here. IMO it would be worth being more granular with the exception types, so the reporting will be more specific. This isn't one that you're likely to want to catch since it would be better to branch on isascii
, but you may still want to deal with it when ascii
throws deep inside some API you don't have control over.
To be perfectly honest, worrying about more granular exception types with all this other stuff to deal with in this series of changes is more than I care to think about. If you have a coherent plan for string exception types, feel free to write it up.
TODO: Cstring
and Cwstring
should ensure that string data is NUL terminated (as well as NUL free).
Ref https://github.com/JuliaLang/julia/issues/16499. Also need a way to express conversion of String
to non-NUL-terminated UTF-16 data with known length. Perhaps convert(Vector{UInt16}, s)
?
@StefanKarpinski, in previous incarnations, any UInt8
array allocated by Julia was automatically NUL-terminated internally; is this no longer the case? And UTF16String
and UTF32String
were NUL-terminated, so convert(Cwstring, string)
was also.
Which items are slated for 0.5? Through round 3, IIRC?
Yes, that's correct. Tomorrow/Wednesday I need to create a LegacyStrings package, put all the Unicode stuff in it and then merge my PR that removes all of that stuff with deprecations that point at it.
PR is up to remove RepString; it has already been added to LegacyStrings. RevString is used by some Base functions so we might want to leave it for now. Anything else here planned for 0.6?
While you're doing string stuff, it's probably not too hard to just actually do utf-8 reversal on strings instead of using the RevString
type – that would advance the highlander agenda just a bit further. (If you feel like it and have some spare type while waiting for type revamp test to run or something.)
Other than removing RevString
everything that's likely to be done is already done.
Most helpful comment
Other than removing
RevString
everything that's likely to be done is already done.