Julia: Stringapalooza

Created on 29 Apr 2016 · 64Comments · Source: JuliaLang/julia

0.5 Major tasks

Round 1

[x] Base: merge ASCIIString, UTF8String and ByteString into String, #16058
[x] Compat: add String, https://github.com/JuliaLang/Compat.jl/pull/192

Round 2
[x] Base: replace utf8, bytestring ~~and string~~ with String, #16453, #16469
[x] Base: replace s = ascii(s) with s = String(s); isascii(s) || error(...), #16396

Round 3
[x] Base: cleanup conversion mess to and from string types #16470, #16713, #16731
[x] Base: provide a way to interact with Windows APIs, via Cwstring #16975, #16974
[x] make package with ASCII, Latin-1, UTF-8, UTF-16 and UTF-32 string types (LegacyStrings).
[x] Base: remove UTF16String and UTF32String #16590

Cleanup tasks
[x] figure out LineEdit test change, https://github.com/JuliaLang/julia/pull/16058#discussion_r61192341, fix: https://github.com/JuliaLang/julia/pull/16198
[x] delete redundant String inner constructor, https://github.com/JuliaLang/julia/pull/16058#discussion_r61228065
[x] undelete "redundant" String inner constructor, https://github.com/JuliaLang/julia/pull/16058#issuecomment-217059003
[x] replace jl_is_utf8_string and jl_is_byte_string with jl_is_string, https://github.com/JuliaLang/julia/pull/16058#discussion_r61232464
[x] figure out code coverage drop, https://github.com/JuliaLang/julia/pull/16058#issuecomment-217042928
[x] use ascii in uv_getaddrinfo to error on non-ASCII domain names, https://github.com/JuliaLang/julia/pull/16058#discussion_r61252081
[x] use ascii in SuiteSparse to error on non-ASCII inputs,
https://github.com/JuliaLang/julia/pull/16058#discussion_r61252299
[x] remove special case printing for String type, https://github.com/JuliaLang/julia/commit/5de52cf9c9343cfcf50be4c7c736290d3f985961#commitcomment-17373792, https://github.com/JuliaLang/julia/pull/16221
[x] make sure tests pass with inlining off, https://github.com/JuliaLang/julia/pull/16058#issuecomment-217181576
[x] add deprecation for readdlm's ignore_invalid_chars option, https://github.com/JuliaLang/julia/pull/16058#discussion_r61190549, e95f5f28292a805997603e4f351da9989066c4f8
[x] remove documentation for readdlm's ignore_invalid_chars option 969d61b2e6d393aad3328aa137174ac3f99ae65f
[x] NEWS.md entries
[x] fix signature of connect in docs,
[x] update doc/manual/interacting-with-julia.rst (with doctests), https://github.com/JuliaLang/julia/pull/16058#discussion_r61525930
[x] https://github.com/JuliaArchive/LegacyStrings.jl/issues/4

0.6 Major tasks

Round 4

[ ] Base: change Char representation (allow lossless string processing of any data)
[x] Base: remove RepString (moved to LegacyStrings)
[ ] Base: remove RevString (move to package?)
[ ] Base: merge SubString and String (add offset field to String)

Cleanup tasks
[ ] make prevind("ll", 5) and such errors, https://github.com/JuliaLang/julia/pull/16058#issuecomment-216957933
https://github.com/JuliaLang/julia/pull/16058#discussion_r61231943
[ ] improve isspace implementation, https://github.com/JuliaLang/julia/pull/16058#discussion_r61231737
[ ] figure out why removing seemingly redundant convert method breaks bootstrap, https://github.com/JuliaLang/julia/pull/16058#discussion_r61231162
[x] simplify takebuf API, https://github.com/JuliaLang/julia/pull/16058#discussion_r61229502, https://github.com/JuliaLang/julia/pull/19088
[x] move docstrings inline

excision needs docs strings unicode

Source

StefanKarpinski

👍14

Most helpful comment

Other than removing RevString everything that's likely to be done is already done.

StefanKarpinski on 20 Jul 2017

🎉3

All 64 comments

Great list.

What about windows APIs that use utf-16?

JeffBezanson on 29 Apr 2016

What about windows APIs that use utf-16?

Already taken care of: https://github.com/JuliaLang/julia/pull/15033.

Two more potential rounds:

removing RepString and RevString
merging SubString and String

Might not get to those until the next release though.

StefanKarpinski on 29 Apr 2016

👍1

What about:

String iteration: making this faster, based on your experiments here in the past
Decide on String indexing: #9297

I definitely think there's a lot to play with around merging Substring and String, but it certainly feels like 0.6 material.

quinnj on 29 Apr 2016

It was just added, but now that readstring(io) is just String(read(io)) it maybe doesn't even need a separate name forever

tkelman on 29 Apr 2016

👍1

just for my curiosity: Why should this be part of 0.5? The title of the release had something to do with Arrays ... not Strings?

lobingera on 29 Apr 2016

❤1

+1 for including as much of this plan as possible into 0.5. Breakage better happen soon.

As regards moving ASCIIString, etc. to a package (round 3, bullet 2), see discussion on an implementation for any encoding in StringEncodings.jl here.

As regards changing Char's underlying representation(round 4, bullet 1), another step would also be to introduce AbstractChar and use it in method signatures to allow e.g. the ASCIIString replacement to implement ASCIIChar <: AbstractChar as a UInt8. This would allow people working with ASCII data to actually enjoy higher performance than before.

EDIT: Finally, I'd like to see a discussion regarding the opportunity of ensuring that String only holds valid UTF-8 data or not, and how to handle file paths (special string type or not). But I don't want to derail this already rich thread, so maybe better open a separate issue for that?

nalimilan on 29 Apr 2016

just for my curiosity: Why should this be part of 0.5? The title of the release had something to do with Arrays ... not Strings?

Because I've been working on this for months and it's ready to go. It won't hold up the release anyway.

StefanKarpinski on 29 Apr 2016

I somehow agree, it will not hold the release of the language. But more syntax changes in 0.5 give some impact in porting packages to 0.5; maybe i'm just wrong that this causes effort ...

lobingera on 29 Apr 2016

I somehow agree, it will not hold the release of the language. But more syntax changes in 0.5 give some impact in porting packages to 0.5; maybe i'm just wrong that this causes effort ...

And more syntax changes in 0.6 will cause effort for even more packages. And given the growth rate of the package ecosystem...

nalimilan on 29 Apr 2016

15033 didn't fully provide a path for external packages to call Windows APIs, without calling internal functions.

stevengj on 29 Apr 2016

True – added to the roadmap.

StefanKarpinski on 29 Apr 2016

Actually, question here: do people think we should keep ascii and have it convert strings to standard String type and error if the content is not plain ASCII? It's kind of a useful function to have.

StefanKarpinski on 6 May 2016

👍3

Another question about behavior. String and string don't actually behave in the same manner always:

julia> String(UInt8[97,98,99])
"abc"

julia> string(UInt8[97,98,99])
"UInt8[97,98,99]"

Any thoughts on resolving this? Currently utf8 and ascii behave like String not like string.

StefanKarpinski on 6 May 2016

Maybe String! if it takes ownership of the array, i.e. is "in-place"?

There is also the bytestring function, whose name seems like a holdover from ByteString, but whose function is essential.

stevengj on 6 May 2016

I'm pretty sure that usage of ! does not have the @JeffBezanson seal of approval.

StefanKarpinski on 6 May 2016

String¡, then.

stevengj on 6 May 2016

😄1

This case doesn't bother me that much --- unlike utf8 and ascii, string is not an encoding.

JeffBezanson on 6 May 2016

The point is, we need some function to replace bytestring(::Vector{UInt8}) and bytestring(::Ptr{UInt8}, [len]) (makes a copy of bytes), but also UTF8String(::Vector{UInt8}) and pointer_to_string (doesn't make a copy).

It makes sense to me to name them all the same thing, with ! for the "in-place" versions. But I don't know what that name should be. Maybe bytestring and bytestring!, where byte refers to the encoding? Or utf8 and utf8! to be more explicit about the encoding, with ascii doing an additional assert?

stevengj on 6 May 2016

Can we just keep bytestring?

JeffBezanson on 6 May 2016

@JeffBezanson, bytestring is fine if a bit vague, but it always makes a copy. Would you be okay with bytestring! for the non-copying version?

stevengj on 6 May 2016

I'm not prepared to make this the _first ever_ case where ! means something other than mutation.

JeffBezanson on 6 May 2016

Then what do we call the non-copying version(s) of bytestring?

stevengj on 6 May 2016

I've probably missed some of the discussion here, but can String have the same constructors that UTF8String had? Plus one with a Ptr argument?

JeffBezanson on 6 May 2016

@JeffBezanson, the problem is that then it can't replace string.

stevengj on 6 May 2016

That's fine with me. Somehow we need one function that wraps a UInt8 vector as a string, and another that gives you the output of print as a string. We could rename string to something like sprint, except that's taken.

JeffBezanson on 6 May 2016

It's a little weird if String and string do almost all the same things – except if called on a byte array.

StefanKarpinski on 6 May 2016

Namewise, yes it's confusing, but functionally they don't seem all that similar to me:

julia> String("a", "b", "c")
ERROR: MethodError: no method matching String(::String, ::String, ::String)
Closest candidates are:
  String{T}(::Any)

JeffBezanson on 6 May 2016

@StefanKarpinski, I thought @JeffBezanson's proposal was that String should _only_ construct a String from a byte array (and maybe from another AbstractString), and that if you want an arbitrary object's string representation you should continue to call string. Hence they will do entirely different things.

stevengj on 6 May 2016

Yes.

JeffBezanson on 6 May 2016

bytestring still seems to be the odd man out. The proposal is that: string(x...) makes a string representation of x..., String(a) makes a string out of a byte array (in-place), and bytestring(a) makes a string out of a byte array (out-of-place)?

stevengj on 6 May 2016

I feel like this has come up in a few other cases (conversion vs. construction). I think it'd be good to separate the two with different methods here:

String(::Vector{UInt8}) => constructs String from byte vector
String(::Ptr{UInt8}, len, copy::Bool=true) => constructs String from pointer + len, making copy by default

# conversions; using `string` or perhaps a slightly more distinct `tostring`
tostring(::Vector{UInt}) => string representation of the byte vector
#etc.

quinnj on 6 May 2016

@stevengj Yes. That arrangement is relatively non-breaking. But it looks like we could perhaps merge bytestring and pointer_to_string?

string isn't really a conversion --- in the sense of convert, at least. An accurate name would be something like print_to_string, which is just too long.

JeffBezanson on 6 May 2016

The proposal is that: string(x...) makes a string representation of x..., String(a) makes a string out of a byte array (in-place), and bytestring(a) makes a string out of a byte array (out-of-place)?

In that case, we could just spell bytestring(a) as String(copy(a)).

StefanKarpinski on 6 May 2016

That's true, and the remaining cases are pointers that can be handled by pointer_to_string (which seems to be undocumented BTW).

JeffBezanson on 6 May 2016

What about when you want to convert an AbstractString of some other type to a String. Do we write that as String(s) or convert(String, s) or what? I think that @quinnj is proposing that pointer_to_string be replaced by String(p::Ptr{UInt8}, len=strlen(p); copy::Bool=true).

StefanKarpinski on 6 May 2016

I actually left copy as a positional argument rather than keyword, since this might be a performance-sensitive function (i.e. CSV, ODBC, SQLite getting data from other C libraries as pointers).

quinnj on 6 May 2016

Sure. That seems ok. The next function improvement we need seems to be making keywords faster.

StefanKarpinski on 6 May 2016

For pointers, I would prefer just having a String(p::Ptr{UInt8}, len=cstrlen(p)) method that makes a copy, analogous to bytestring(p, len) now. bytestring(ptr) is extremely common (probably more common than bytestring(array), because of its utility in calling C functions, so we should try to replace it with something simple and efficient like String(ptr) rather than the slower and more cumbersome String(copy(pointer_to_string(ptr)).

I implemented pointer_to_string for some internal stuff and left it undocumented out of caution (maybe I shouldn't have exported it?), because it is somewhat "unsafe" (analogous to pointer_to_array). Maybe we should leave it as-is, as it parallels pointer_to_array, and decide whether to document it. I don't think we should encourage casual use by including it in the String constructor.

Summary:

string(args...) ⟶ string(args...)
bytestring(array) ⟶ String(copy(array))
bytestring(ptr, [len]) ⟶ String(ptr, [len])
pointer_to_string ⟶ pointer_to_string, analogous to pointer_to_array

stevengj on 6 May 2016

I thing String(s::AbstractString) should work; no reason to force the user to call convert here.

stevengj on 6 May 2016

Ok, it seems like we have a plan here:

[x] String(a::Vector{UInt8}) takes "ownership" of a
[x] String(s::String) is just the identity?
[x] String(s::AbstractString) converts strings of other types to String
[x] String(p::Ptr{UInt8}, len::Integer=strlen(p)) from a byte pointer, copies data
[x] string(x...) stringifies (via print) its arguments and concatenates to a single String
[x] bytestring is deprecated in favor of String, #16453
[x] utf8 is deprecated in favor of String, #16469
[x] ascii remains and converts to String, erroring if the data is non-ASCII, #16396

The only thing I'm wondering is what about a function for concatenating to a specific type of string (not String) and/or concatenating strings of the same type to get a string of that same type? This API doesn't seem to include anything for that. I guess we could define string to support that?

StefanKarpinski on 6 May 2016

I would really discourage you from adding a copy argument to the String constructor (in which case you would also need an owns argument to fully replicate pointer_to_string); people should _not_ use this feature casually. It is on the same footing, safety-wise, as pointer_to_array, which we carefully segregate into the "unsafe pointer games" section of the manual. We aren't merging pointer_to_array into the Array constructor, after all.

stevengj on 6 May 2016

👍2

That's probably a good call. There was talk once of doing an Unsafe module; where did that end up? Worth giving a shot?

quinnj on 6 May 2016

So no methods of String to construct strings from pointers?

StefanKarpinski on 6 May 2016

I think we would just have String(p::Ptr{UInt8}, len::Integer=strlen(p)), that would always make a copy and pointer_to_string that didn't make a copy.

quinnj on 6 May 2016

👍1

I think we would just have String(p::Ptr{UInt8}, len::Integer=strlen(p)), that would always make a copy and pointer_to_string that didn't make a copy.

If we keep pointer_to_string(), then let's move all pointer-related operations to it. I don't like the idea of mixing these low-level operations with the standard String constructor.

Regarding string(), I'm afraid we'll have to merge it with String() as well for consistency with Symbol. Indeed, with the recent changes, Symbol(1, 2) parallels string(1, 2), not String(1, 2). But then I don't know how to handle String(a::Vector{UInt8}); maybe via convert?

nalimilan on 8 May 2016

@nalimilan, bytestring(ptr, [len]) (which makes a copy) has been used without problems for years now and is extremely common in ccall-using code; if the bytestring function goes away, it makes sense to keep this functionality easily accessible (e.g.) in the String constructor. pointer_to_string is in a completely different category — like pointer_to_array, it is extremely unsafe if used casually, so it makes sense to segregate it into the "unsafe" section of the manual. The two cases aren't comparable at all.

One option would be to make string(...) -> String(...), and use either bytestring or utf8 to construct a String from Array{UInt8} or Ptr{UInt8}, on the theory that the latter cases require you to explicitly specify an encoding.

stevengj on 9 May 2016

Symbol(1, 2) parallels string(1, 2), not String(1, 2).

It seems a little weird to me that we have this method at all.

StefanKarpinski on 9 May 2016

It seems a little weird to me that we have this method at all.

@StefanKarpinski I think the most common use of it is Symbol("s", i) with i an integer (see the diff at https://github.com/JuliaLang/julia/pull/16154). It can easily be replaced with Symbol(string("s", i)), though it's a bit more verbose.

@stevengj If we indeed keep string() and use String only to construct a String from a series of bytes or from another string, then I'm OK with it accepting pointers too.

nalimilan on 9 May 2016

Or Symbol("$s$i") or Symbol(s*i) or if https://github.com/JuliaLang/julia/issues/9945 happens, you could even write :"$s$i", assuming we decide to support interpolation in that kind of symbol literal.

StefanKarpinski on 9 May 2016

👍2

Hi @StefanKarpinski ! How are this changes going to affect string indexing?
There are some formats (i.e. PDB) where some values are at determined indexes (i.e. chain identifier in the column 22 and residue name in 18:20). Would it be safe to do line[22] in the new String type?

diegozea on 10 May 2016

The current work doesn't change anything really – it just essentially renames UTF8String to String and gets rid of ASCIIString.

StefanKarpinski on 10 May 2016

I raised this in #16396 but didn't get a response there, want to make sure we decide something. It's related to a few of the checkboxes.

For invalid ascii (and later, invalid utf-8) do we want to throw a more specific exception type than ArgumentError? We have UnicodeError now, but I suspect that will be significantly refactored when UTF16 and UTF32 types get moved out of base.

tkelman on 18 May 2016

UnicodeError is going away, so no, this is just an ArgumentError.

StefanKarpinski on 18 May 2016

That's also the least change since it's what the ascii function threw previously.

StefanKarpinski on 18 May 2016

Well we are in the midst of changing a lot of other things here. IMO it would be worth being more granular with the exception types, so the reporting will be more specific. This isn't one that you're likely to want to catch since it would be better to branch on isascii, but you may still want to deal with it when ascii throws deep inside some API you don't have control over.

tkelman on 18 May 2016

To be perfectly honest, worrying about more granular exception types with all this other stuff to deal with in this series of changes is more than I care to think about. If you have a coherent plan for string exception types, feel free to write it up.

StefanKarpinski on 18 May 2016

TODO: Cstring and Cwstring should ensure that string data is NUL terminated (as well as NUL free).

StefanKarpinski on 2 Jun 2016

Ref https://github.com/JuliaLang/julia/issues/16499. Also need a way to express conversion of String to non-NUL-terminated UTF-16 data with known length. Perhaps convert(Vector{UInt16}, s)?

StefanKarpinski on 2 Jun 2016

@StefanKarpinski, in previous incarnations, any UInt8 array allocated by Julia was automatically NUL-terminated internally; is this no longer the case? And UTF16String and UTF32String were NUL-terminated, so convert(Cwstring, string) was also.

stevengj on 2 Jun 2016

Which items are slated for 0.5? Through round 3, IIRC?

JeffBezanson on 28 Jun 2016

Yes, that's correct. Tomorrow/Wednesday I need to create a LegacyStrings package, put all the Unicode stuff in it and then merge my PR that removes all of that stuff with deprecations that point at it.

StefanKarpinski on 28 Jun 2016

PR is up to remove RepString; it has already been added to LegacyStrings. RevString is used by some Base functions so we might want to leave it for now. Anything else here planned for 0.6?

JeffBezanson on 4 Jan 2017

While you're doing string stuff, it's probably not too hard to just actually do utf-8 reversal on strings instead of using the RevString type – that would advance the highlander agenda just a bit further. (If you feel like it and have some spare type while waiting for type revamp test to run or something.)

StefanKarpinski on 6 Jan 2017

Other than removing RevString everything that's likely to be done is already done.

StefanKarpinski on 20 Jul 2017

🎉3

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Getting Openssl error while building from source on ubuntu 14.04.3 cmake version 2.8.12.2

arshpreetsingh · 3Comments

Dates.format regression on master: Width of milliseconds field cannot be fixed anymore

helgee · 3Comments

Another illegal instruction

tkoolen · 3Comments

Literal NaNs in ASTs

Keno · 3Comments

add special display for ≈ test failures

StefanKarpinski · 3Comments

Julia: Stringapalooza

0.5 Major tasks

Round 1

Round 2

Round 3

Cleanup tasks

0.6 Major tasks

Round 4

Cleanup tasks

Most helpful comment

All 64 comments

15033 didn't fully provide a path for external packages to call Windows APIs, without calling internal functions.

Related issues