So a discussion came un on slack that this behaviour
julia> x = UInt8[73, 74]
2-element Array{UInt8,1}:
0x49
0x4a
julia> String(x)
"IJ"
julia> x
0-element Array{UInt8,1}
is fairly surprising.
One normally expects that function calls be explicit if they mutate data and I think cosntructors like String
should be held to similar standards. I see there was some discussion in #26093, and the conclusion seemed to be that since there is no type String!
then naming the constructor String!
doesn't make sense, especially because (according to those more knowledgable than I) nearly 100% of uses of String
want the memory stealing behaviour.
I think it is worth revisiting this issue. I'd argue that even though one usually wants memory stealing, it's best to always be explicit if that is happening, either through a keyword argument, e.g. String(x, steal=true)
, or through a different constructor name String!
.
The current behaviour seems like a real footgun to me, even if it's convenient when you know what you're doing.
There is the issue that changing the current stealing behavior is technically breaking. It may be that no one is relying on this behavior so that it could be considered a "minor change" but we'd have to test that out. I'm guessing that the most common case is that after a byte vector has been "stolen" it is never used again so the code won't care if it's truncated or not.
String(take!(io))
is the de facto way of being the performant way of getting a string out of an io object (cf the docs for String
):
When possible, the memory of v will be used without copying when the String object is created. This is guaranteed to be the case for byte vectors returned by
take!
on a writable IOBuffer and by calls to read(io, nb). This allows zero-copy conversion of I/O data to strings.
Breaking this guarantee is to me "breaking" even if the code still (eventually) produces the same answer.
I wouldn't be surprised or even necessarily upset if this was considered too big a change for a minor release (though if possible, I'd still prefer it in a minor release). However, I think this is confusing semantics that goes against standard naming conventions and should at minimum be considered for 2.0
.
Telling people that mutating functions are marked by a !
and then not following it in seemingly arbitrary cases is worse than not having the convention at all since it lures one into a false sense of security when they read code.
It's worth remembering what we did before, which was share data (if possible) but not truncate the vector. After all, truncating the vector is not necessary. Then you just need to tell people not to mutate the vector later. Frankly I liked that a bit better, since String(x)
was indeed non-mutating, and mutating a vector after turning it into a string is really rare. Of course the objection was that if you did that, it would cause hard-to-find bugs. That was "solved" by making the behavior predictably bad every time so the potential problem is easier to catch. I get the reasoning, but that's still not a philosophy I fully agree with.
There was also an intermediate time when this would sometimes truncate the byte array and sometimes not, depending on details of how the byte array was constructed. For 2.0, this would be a great use case for @keno's freeze/thaw stuff since you could just say that String
requires a frozen vector and have the array and the string safely share memory. If someone wanted to mutate the byte array afterwards, they'd have to thaw it, which would at that point create a copy.
How much of the surprise factor here would be cured if the function were called String!
instead?
This exact behavior once cost me an hour of debugging, before I finally figured out what was going on :smile:. I would much prefer if String(::Vector{UInt8})
made a copy instead, with the explicit option to move and truncate the underlying bytes. This would technically be "merely" a performance regression, though I acknowledge that it may be big enough in practice to be considered breaking.
Would it be crazy to just deprecate (and eventually delete) the String(::Array)
constructor, and have String!
be the _only_ string constructor for arrays of chars? This would continue the current behavior (never copying) and just amount to a _rename_ to make the behavior clearer?
That's definitely a breaking change, so unfortunately, yes, it's "crazy" for a 1.x release. We could, however, leave String
, change it to never truncate and introduce String!
with the current behavior of String
. That has the issue that it makes String(take!(io))
copying and therefore slow.
That's definitely a breaking change, so unfortunately, yes, it's "crazy" for a 1.x release.
Ah, yes, sorry, I was predicating that on saving this for 2.0, per @MasonProtter's point.
That has the issue that it makes String(take!(io)) copying and therefore slow
Yeah, that's what I was hoping to avoid by forcing everyone to do the replacement from String to String! for those cases.
If we're willing to wait until 2.0 to do this then yes, we can deprecate String
for String!
and then later reintroduce String
to always make a copy. But for 2.0, as I said above, we may well have the ability to freeze and thaw arrays and require passing a frozen byte array to String
.
One thing to keep in mind with this discussion is that people often automatically generate constructors via a where T
and then do T(data)
.
On one hand, we don't want to suddenly make that pattern less performant when applied to strings, but on the other hand, I think it's fair to say that I would be pretty baffled if I wrote something like T(data)
and suddenly my data disappeared. So the freeze-thaw thing (#31630) does kinda sound like the best of both worlds.
requires a frozen vector
In Keno's proposal you get that by making a copy of the vector. So we could also just use the existing copy
function, if that's what we want. In the future (independent of #31630), in some cases, we can implement a copy-elision pass. It doesn't make much difference to the compiler if you call it copy-elision or freeze-elision鈥攊n this case they're probably much the same computation since the result escapes (aka, is returned from a function). In the string-builder case, the compiler already has a really hard time figuring out types (which are always immutable). Realistically, I'm not confident the compiler will be ever able to elide this particular copy in the cases we most care about.
Can this be marked for the 2.0 milestone so that it is not missed accidentally?
Most helpful comment
There was also an intermediate time when this would sometimes truncate the byte array and sometimes not, depending on details of how the byte array was constructed. For 2.0, this would be a great use case for @keno's freeze/thaw stuff since you could just say that
String
requires a frozen vector and have the array and the string safely share memory. If someone wanted to mutate the byte array afterwards, they'd have to thaw it, which would at that point create a copy.