Julia: Conversion of String to Vector{UInt8}

Created on 29 Oct 2017  路  16Comments  路  Source: JuliaLang/julia

Consider the following code:

s = "123"
v = Vector{UInt8}(s)

Now changing v mutates s.
Since strings should be immutable I would it find it more natural to:

  • define reinterpret(UInt8, s::String) that would perform such in-place conversion (as this is what one would expect from reinterpret);
  • make convert(Vector{UInt8}, s::String) allocate a new vector (so that users do not accidentally mutate strings).

Additionally this would be consistent with the general behavior that Vector{UInt8}(s) for AbstractString other than String would create a copy.

A non-breaking alternative would be to add an information to the documentation that this conversion is in-place.

strings

Most helpful comment

Another idea: take!(buf, String), matching read. That way take! can just be considered a destructive version of read.

All 16 comments

It does seem like observing the mutation of a string should need to involve something unsafe_: all of these are normal and apparently safe operations, which would seem to violate string immutability.

If we want to be on a safe side then String(::Vector{UInt8}) also should make a copy by default. The good thing is that it is documented that it does not do it now (as opposed to Vector{UInt8}(::String)), but changing this would be consistent with the above line of thinking.

Regarding the names - there are probably people with more experience with naming conventions in Base than me. Maybe we can use unsafe_string_to_array and unsafe_array_to_string as the underlying C functions are jl_string_to_array and jl_array_to_string. However, I am open to suggestions.

In summary the proposal would be to:

  • change Vector{UInt8}(::String) so that it performs a copy;
  • change String(::Vector{UInt8}) so that it performs a copy;
  • define two new functions performing unsafe conversions between String and Vector{UInt8}; the proposed names are unsafe_string_to_array and unsafe_array_to_string.

If the above are agreed I can make an appropriate PR.

That does sound good. Another possible name suggestion would be to make this a method of unsafe_wrap

Maybe we should add this to triage, since this would be a potentially breaking change.

This is going to be a performance disaster. The main problem is that String(take!(buf)) is the idiom for converting an IOBuffer (used as a string builder) to a String in-place. reinterpret(String, take!(buf)) is too long and weird to take its place. Changing String( ) requires a solution to this.

Changing String( ) was less of a priority in my opinion as it is well documented - I have added it later to the discussion to have a complete picture.

From my point of view the key problem is to make Vector{UInt8}(s) conversion make a copy. And if only this part would be left, and reinterpret is considered not a very good name for conversion from String to Vector{UInt8}, actually I like adding the unsafe_wrap(::Type{Vector{UInt8}},::String) method as it would not introduce a new name into Base.

As long as some unsafe_ function needs to be called in order to observe string mutability, I think it's fine. The real trouble here is that it's possible to observe string mutation without calling any explicitly unsafe functions.

I think the following would be safe and efficient:

  1. Make Vector{UInt8}(s) copy, as requested here. Internally, it should use a String for the new storage so you can go back to String without yet another copy.
  2. String(vec) can continue to share memory (if the vector is String-backed), but we also set the vector's size to 0 and remove its storage so that it's impossible to mutate the string via the vector from then on.

Being able to "destroy" arrays like that would be generally useful for APIs where a new object "takes control" of the array's contents and anyone else mutating it after that is generally dangerous. And not just for arrays, if we had deterministic finalization a la #7721, being able to make any usage of the finalized object an error after finalization would be ideal.

With this, we might want to export StringVector as well, since using it becomes safer.

Accepted by triage. We can go with the plan in https://github.com/JuliaLang/julia/issues/24388#issuecomment-341244445. No need to export StringVector.

It would be good to know how many uses of Vector{UInt8}(str) there are. There are a fair number in Base. I think most cases just want to access the bytes in the string, which can be done more efficiently using codeunit. The problem is that changing this will silently give worse performance, so we might want to consider deprecating it at least temporarily.

This reminds me of a possible Hermitian! constructor which was evoked at https://github.com/JuliaLang/julia/pull/17367#issuecomment-231796917. The problem was somewhat similar: the constructor would have modified the input (slightly) to have it match the type's requirements. A discussion happened about what constructors are allowed to do to their inputs even if they don't have ! in their names.

If we make String(x::Vector{UInt8}) "destroy" its input, we should make it very clear in the manual that all constructors are allowed to do that. Personally, I'd rather use String! for that, since the vast majority of constructors will treat their inputs as const: to keep the language simple, better state that constructors do not mutate their inputs, and force the exceptions to use !. It's easy to replace String(take!(buf)) with String!(take!(buf)) in existing code to preserve performance, that's even something that Femtocleaner could propose by default.

Another idea: take!(buf, String), matching read. That way take! can just be considered a destructive version of read.

read!!?

Current behaviour of Vector{UInt8}(s) is inconsistent between String and SubString, which is correct? Is the intended behaviour documented somewhere? (there is no mention of strings in ?Vector{UInt8}).

julia> VERSION
v"1.0.1"

julia> s, ss = "FOO", SubString("FOO")
("FOO", "FOO")

julia> v, vv = Vector{UInt8}(s), Vector{UInt8}(ss)
(UInt8[0x46, 0x4f, 0x4f], UInt8[0x46, 0x4f, 0x4f])

julia> v[2], vv[2] = 42, 42
(42, 42)

julia> v, vv
(UInt8[0x46, 0x2a, 0x4f], UInt8[0x46, 0x2a, 0x4f])

julia> s, ss
("F*O", "FOO")

unsafe_wrap behaves as expected:

julia> s, ss = "FOO", SubString("FOO")
("FOO", "FOO")

julia> v, vv = unsafe_wrap(Vector{UInt8}, pointer(s), ncodeunits(s)),
               unsafe_wrap(Vector{UInt8}, pointer(ss), ncodeunits(ss))
(UInt8[0x46, 0x4f, 0x4f], UInt8[0x46, 0x4f, 0x4f])

julia> v[2], vv[2] = 42, 42
(42, 42)

julia> s, ss
("F*O", "F*O")

Woops, good catch!

AFAICT https://github.com/JuliaLang/julia/pull/25241 deprecated Vector{UInt8}(s), but after removing the deprecation in 1.0 the fallback AbstractString method is used. Unfortunately it's defined as unsafe_wrap(Vector{UInt8}, String(s)), which makes a copy for non-String arguments, but not for String arguments... Looks like we forgot to change the behavior in 1.0.

Since strings are supposed to be immutable, I'd say it's OK to change this to make a copy in 1.1. That would qualify as a "minor change".

Was this page helpful?
0 / 5 - 0 ratings