Julia: Stringapalooza

Created on 29 Apr 2016  Â·  64Comments  Â·  Source: JuliaLang/julia

0.5 Major tasks

Round 1


0.6 Major tasks

Round 4

excision needs docs strings unicode

Most helpful comment

Other than removing RevString everything that's likely to be done is already done.

All 64 comments

Great list.

What about windows APIs that use utf-16?

What about windows APIs that use utf-16?

Already taken care of: https://github.com/JuliaLang/julia/pull/15033.

Two more potential rounds:

  • removing RepString and RevString
  • merging SubString and String

Might not get to those until the next release though.

What about:

  • String iteration: making this faster, based on your experiments here in the past
  • Decide on String indexing: #9297

I definitely think there's a lot to play with around merging Substring and String, but it certainly feels like 0.6 material.

It was just added, but now that readstring(io) is just String(read(io)) it maybe doesn't even need a separate name forever

just for my curiosity: Why should this be part of 0.5? The title of the release had something to do with Arrays ... not Strings?

+1 for including as much of this plan as possible into 0.5. Breakage better happen soon.

As regards moving ASCIIString, etc. to a package (round 3, bullet 2), see discussion on an implementation for any encoding in StringEncodings.jl here.

As regards changing Char's underlying representation(round 4, bullet 1), another step would also be to introduce AbstractChar and use it in method signatures to allow e.g. the ASCIIString replacement to implement ASCIIChar <: AbstractChar as a UInt8. This would allow people working with ASCII data to actually enjoy higher performance than before.

EDIT: Finally, I'd like to see a discussion regarding the opportunity of ensuring that String only holds valid UTF-8 data or not, and how to handle file paths (special string type or not). But I don't want to derail this already rich thread, so maybe better open a separate issue for that?

just for my curiosity: Why should this be part of 0.5? The title of the release had something to do with Arrays ... not Strings?

Because I've been working on this for months and it's ready to go. It won't hold up the release anyway.

I somehow agree, it will not hold the release of the language. But more syntax changes in 0.5 give some impact in porting packages to 0.5; maybe i'm just wrong that this causes effort ...

I somehow agree, it will not hold the release of the language. But more syntax changes in 0.5 give some impact in porting packages to 0.5; maybe i'm just wrong that this causes effort ...

And more syntax changes in 0.6 will cause effort for even more packages. And given the growth rate of the package ecosystem...

15033 didn't fully provide a path for external packages to call Windows APIs, without calling internal functions.

True – added to the roadmap.

Actually, question here: do people think we should keep ascii and have it convert strings to standard String type and error if the content is not plain ASCII? It's kind of a useful function to have.

Another question about behavior. String and string don't actually behave in the same manner always:

julia> String(UInt8[97,98,99])
"abc"

julia> string(UInt8[97,98,99])
"UInt8[97,98,99]"

Any thoughts on resolving this? Currently utf8 and ascii behave like String not like string.

Maybe String! if it takes ownership of the array, i.e. is "in-place"?

There is also the bytestring function, whose name seems like a holdover from ByteString, but whose function is essential.

I'm pretty sure that usage of ! does not have the @JeffBezanson seal of approval.

String¡, then.

This case doesn't bother me that much --- unlike utf8 and ascii, string is not an encoding.

The point is, we need some function to replace bytestring(::Vector{UInt8}) and bytestring(::Ptr{UInt8}, [len]) (makes a copy of bytes), but also UTF8String(::Vector{UInt8}) and pointer_to_string (doesn't make a copy).

It makes sense to me to name them all the same thing, with ! for the "in-place" versions. But I don't know what that name should be. Maybe bytestring and bytestring!, where byte refers to the encoding? Or utf8 and utf8! to be more explicit about the encoding, with ascii doing an additional assert?

Can we just keep bytestring?

@JeffBezanson, bytestring is fine if a bit vague, but it always makes a copy. Would you be okay with bytestring! for the non-copying version?

I'm not prepared to make this the _first ever_ case where ! means something other than mutation.

Then what do we call the non-copying version(s) of bytestring?

I've probably missed some of the discussion here, but can String have the same constructors that UTF8String had? Plus one with a Ptr argument?

@JeffBezanson, the problem is that then it can't replace string.

That's fine with me. Somehow we need one function that wraps a UInt8 vector as a string, and another that gives you the output of print as a string. We could rename string to something like sprint, except that's taken.

It's a little weird if String and string do almost all the same things – except if called on a byte array.

Namewise, yes it's confusing, but functionally they don't seem all that similar to me:

julia> String("a", "b", "c")
ERROR: MethodError: no method matching String(::String, ::String, ::String)
Closest candidates are:
  String{T}(::Any)

@StefanKarpinski, I thought @JeffBezanson's proposal was that String should _only_ construct a String from a byte array (and maybe from another AbstractString), and that if you want an arbitrary object's string representation you should continue to call string. Hence they will do entirely different things.

Yes.

bytestring still seems to be the odd man out. The proposal is that: string(x...) makes a string representation of x..., String(a) makes a string out of a byte array (in-place), and bytestring(a) makes a string out of a byte array (out-of-place)?

I feel like this has come up in a few other cases (conversion vs. construction). I think it'd be good to separate the two with different methods here:

String(::Vector{UInt8}) => constructs String from byte vector
String(::Ptr{UInt8}, len, copy::Bool=true) => constructs String from pointer + len, making copy by default

# conversions; using `string` or perhaps a slightly more distinct `tostring`
tostring(::Vector{UInt}) => string representation of the byte vector
#etc.

@stevengj Yes. That arrangement is relatively non-breaking. But it looks like we could perhaps merge bytestring and pointer_to_string?

string isn't really a conversion --- in the sense of convert, at least. An accurate name would be something like print_to_string, which is just too long.

The proposal is that: string(x...) makes a string representation of x..., String(a) makes a string out of a byte array (in-place), and bytestring(a) makes a string out of a byte array (out-of-place)?

In that case, we could just spell bytestring(a) as String(copy(a)).

That's true, and the remaining cases are pointers that can be handled by pointer_to_string (which seems to be undocumented BTW).

What about when you want to convert an AbstractString of some other type to a String. Do we write that as String(s) or convert(String, s) or what? I think that @quinnj is proposing that pointer_to_string be replaced by String(p::Ptr{UInt8}, len=strlen(p); copy::Bool=true).

I actually left copy as a positional argument rather than keyword, since this might be a performance-sensitive function (i.e. CSV, ODBC, SQLite getting data from other C libraries as pointers).

Sure. That seems ok. The next function improvement we need seems to be making keywords faster.

For pointers, I would prefer just having a String(p::Ptr{UInt8}, len=cstrlen(p)) method that makes a copy, analogous to bytestring(p, len) now. bytestring(ptr) is extremely common (probably more common than bytestring(array), because of its utility in calling C functions, so we should try to replace it with something simple and efficient like String(ptr) rather than the slower and more cumbersome String(copy(pointer_to_string(ptr)).

I implemented pointer_to_string for some internal stuff and left it undocumented out of caution (maybe I shouldn't have exported it?), because it is somewhat "unsafe" (analogous to pointer_to_array). Maybe we should leave it as-is, as it parallels pointer_to_array, and decide whether to document it. I don't think we should encourage casual use by including it in the String constructor.

Summary:

  • string(args...) ⟶ string(args...)
  • bytestring(array) ⟶ String(copy(array))
  • bytestring(ptr, [len]) ⟶ String(ptr, [len])
  • pointer_to_string ⟶ pointer_to_string, analogous to pointer_to_array

I thing String(s::AbstractString) should work; no reason to force the user to call convert here.

Ok, it seems like we have a plan here:

  • [x] String(a::Vector{UInt8}) takes "ownership" of a
  • [x] String(s::String) is just the identity?
  • [x] String(s::AbstractString) converts strings of other types to String
  • [x] String(p::Ptr{UInt8}, len::Integer=strlen(p)) from a byte pointer, copies data
  • [x] string(x...) stringifies (via print) its arguments and concatenates to a single String
  • [x] bytestring is deprecated in favor of String, #16453
  • [x] utf8 is deprecated in favor of String, #16469
  • [x] ascii remains and converts to String, erroring if the data is non-ASCII, #16396

The only thing I'm wondering is what about a function for concatenating to a specific type of string (not String) and/or concatenating strings of the same type to get a string of that same type? This API doesn't seem to include anything for that. I guess we could define string to support that?

I would really discourage you from adding a copy argument to the String constructor (in which case you would also need an owns argument to fully replicate pointer_to_string); people should _not_ use this feature casually. It is on the same footing, safety-wise, as pointer_to_array, which we carefully segregate into the "unsafe pointer games" section of the manual. We aren't merging pointer_to_array into the Array constructor, after all.

That's probably a good call. There was talk once of doing an Unsafe module; where did that end up? Worth giving a shot?

So no methods of String to construct strings from pointers?

I think we would just have String(p::Ptr{UInt8}, len::Integer=strlen(p)), that would always make a copy and pointer_to_string that didn't make a copy.

I think we would just have String(p::Ptr{UInt8}, len::Integer=strlen(p)), that would always make a copy and pointer_to_string that didn't make a copy.

If we keep pointer_to_string(), then let's move all pointer-related operations to it. I don't like the idea of mixing these low-level operations with the standard String constructor.

Regarding string(), I'm afraid we'll have to merge it with String() as well for consistency with Symbol. Indeed, with the recent changes, Symbol(1, 2) parallels string(1, 2), not String(1, 2). But then I don't know how to handle String(a::Vector{UInt8}); maybe via convert?

@nalimilan, bytestring(ptr, [len]) (which makes a copy) has been used without problems for years now and is extremely common in ccall-using code; if the bytestring function goes away, it makes sense to keep this functionality easily accessible (e.g.) in the String constructor. pointer_to_string is in a completely different category — like pointer_to_array, it is extremely unsafe if used casually, so it makes sense to segregate it into the "unsafe" section of the manual. The two cases aren't comparable at all.

One option would be to make string(...) -> String(...), and use either bytestring or utf8 to construct a String from Array{UInt8} or Ptr{UInt8}, on the theory that the latter cases require you to explicitly specify an encoding.

Symbol(1, 2) parallels string(1, 2), not String(1, 2).

It seems a little weird to me that we have this method at all.

It seems a little weird to me that we have this method at all.

@StefanKarpinski I think the most common use of it is Symbol("s", i) with i an integer (see the diff at https://github.com/JuliaLang/julia/pull/16154). It can easily be replaced with Symbol(string("s", i)), though it's a bit more verbose.

@stevengj If we indeed keep string() and use String only to construct a String from a series of bytes or from another string, then I'm OK with it accepting pointers too.

Or Symbol("$s$i") or Symbol(s*i) or if https://github.com/JuliaLang/julia/issues/9945 happens, you could even write :"$s$i", assuming we decide to support interpolation in that kind of symbol literal.

Hi @StefanKarpinski ! How are this changes going to affect string indexing?
There are some formats (i.e. PDB) where some values are at determined indexes (i.e. chain identifier in the column 22 and residue name in 18:20). Would it be safe to do line[22] in the new String type?

The current work doesn't change anything really – it just essentially renames UTF8String to String and gets rid of ASCIIString.

I raised this in #16396 but didn't get a response there, want to make sure we decide something. It's related to a few of the checkboxes.

For invalid ascii (and later, invalid utf-8) do we want to throw a more specific exception type than ArgumentError? We have UnicodeError now, but I suspect that will be significantly refactored when UTF16 and UTF32 types get moved out of base.

UnicodeError is going away, so no, this is just an ArgumentError.

That's also the least change since it's what the ascii function threw previously.

Well we are in the midst of changing a lot of other things here. IMO it would be worth being more granular with the exception types, so the reporting will be more specific. This isn't one that you're likely to want to catch since it would be better to branch on isascii, but you may still want to deal with it when ascii throws deep inside some API you don't have control over.

To be perfectly honest, worrying about more granular exception types with all this other stuff to deal with in this series of changes is more than I care to think about. If you have a coherent plan for string exception types, feel free to write it up.

TODO: Cstring and Cwstring should ensure that string data is NUL terminated (as well as NUL free).

Ref https://github.com/JuliaLang/julia/issues/16499. Also need a way to express conversion of String to non-NUL-terminated UTF-16 data with known length. Perhaps convert(Vector{UInt16}, s)?

@StefanKarpinski, in previous incarnations, any UInt8 array allocated by Julia was automatically NUL-terminated internally; is this no longer the case? And UTF16String and UTF32String were NUL-terminated, so convert(Cwstring, string) was also.

Which items are slated for 0.5? Through round 3, IIRC?

Yes, that's correct. Tomorrow/Wednesday I need to create a LegacyStrings package, put all the Unicode stuff in it and then merge my PR that removes all of that stuff with deprecations that point at it.

PR is up to remove RepString; it has already been added to LegacyStrings. RevString is used by some Base functions so we might want to leave it for now. Anything else here planned for 0.6?

While you're doing string stuff, it's probably not too hard to just actually do utf-8 reversal on strings instead of using the RevString type – that would advance the highlander agenda just a bit further. (If you feel like it and have some spare type while waiting for type revamp test to run or something.)

Other than removing RevString everything that's likely to be done is already done.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

helgee picture helgee  Â·  3Comments

iamed2 picture iamed2  Â·  3Comments

TotalVerb picture TotalVerb  Â·  3Comments

omus picture omus  Â·  3Comments

tkoolen picture tkoolen  Â·  3Comments