Julia: String constructor truncates data.

Created on 8 Jul 2019 · 14Comments · Source: JuliaLang/julia

So a discussion came un on slack that this behaviour

julia> x = UInt8[73, 74]
2-element Array{UInt8,1}:
 0x49
 0x4a

julia> String(x)
"IJ"

julia> x
0-element Array{UInt8,1}

is fairly surprising.

One normally expects that function calls be explicit if they mutate data and I think cosntructors like String should be held to similar standards. I see there was some discussion in #26093, and the conclusion seemed to be that since there is no type String! then naming the constructor String! doesn't make sense, especially because (according to those more knowledgable than I) nearly 100% of uses of String want the memory stealing behaviour.

I think it is worth revisiting this issue. I'd argue that even though one usually wants memory stealing, it's best to always be explicit if that is happening, either through a keyword argument, e.g. String(x, steal=true), or through a different constructor name String!.

The current behaviour seems like a real footgun to me, even if it's convenient when you know what you're doing.

speculative strings

Source

MasonProtter

👍11 👎3

Most helpful comment

There was also an intermediate time when this would sometimes truncate the byte array and sometimes not, depending on details of how the byte array was constructed. For 2.0, this would be a great use case for @keno's freeze/thaw stuff since you could just say that String requires a frozen vector and have the array and the string safely share memory. If someone wanted to mutate the byte array afterwards, they'd have to thaw it, which would at that point create a copy.

StefanKarpinski on 8 Jul 2019

👍2

All 14 comments

There is the issue that changing the current stealing behavior is technically breaking. It may be that no one is relying on this behavior so that it could be considered a "minor change" but we'd have to test that out. I'm guessing that the most common case is that after a byte vector has been "stolen" it is never used again so the code won't care if it's truncated or not.

StefanKarpinski on 8 Jul 2019

String(take!(io)) is the de facto way of being the performant way of getting a string out of an io object (cf the docs for String):

When possible, the memory of v will be used without copying when the String object is created. This is guaranteed to be the case for byte vectors returned by take! on a writable IOBuffer and by calls to read(io, nb). This allows zero-copy conversion of I/O data to strings.

Breaking this guarantee is to me "breaking" even if the code still (eventually) produces the same answer.

KristofferC on 8 Jul 2019

👍1

I wouldn't be surprised or even necessarily upset if this was considered too big a change for a minor release (though if possible, I'd still prefer it in a minor release). However, I think this is confusing semantics that goes against standard naming conventions and should at minimum be considered for 2.0.

Telling people that mutating functions are marked by a ! and then not following it in seemingly arbitrary cases is worse than not having the convention at all since it lures one into a false sense of security when they read code.

MasonProtter on 8 Jul 2019

👍1

It's worth remembering what we did before, which was share data (if possible) but not truncate the vector. After all, truncating the vector is not necessary. Then you just need to tell people not to mutate the vector later. Frankly I liked that a bit better, since String(x) was indeed non-mutating, and mutating a vector after turning it into a string is really rare. Of course the objection was that if you did that, it would cause hard-to-find bugs. That was "solved" by making the behavior predictably bad every time so the potential problem is easier to catch. I get the reasoning, but that's still not a philosophy I fully agree with.

JeffBezanson on 8 Jul 2019

StefanKarpinski on 8 Jul 2019

👍2

How much of the surprise factor here would be cured if the function were called String! instead?

yurivish on 8 Jul 2019

This exact behavior once cost me an hour of debugging, before I finally figured out what was going on :smile:. I would much prefer if String(::Vector{UInt8}) made a copy instead, with the explicit option to move and truncate the underlying bytes. This would technically be "merely" a performance regression, though I acknowledge that it may be big enough in practice to be considered breaking.

zzyxyzz on 9 Jul 2019

Would it be crazy to just deprecate (and eventually delete) the String(::Array) constructor, and have String! be the _only_ string constructor for arrays of chars? This would continue the current behavior (never copying) and just amount to a _rename_ to make the behavior clearer?

NHDaly on 10 Jul 2019

👍1

That's definitely a breaking change, so unfortunately, yes, it's "crazy" for a 1.x release. We could, however, leave String, change it to never truncate and introduce String! with the current behavior of String. That has the issue that it makes String(take!(io)) copying and therefore slow.

StefanKarpinski on 10 Jul 2019

That's definitely a breaking change, so unfortunately, yes, it's "crazy" for a 1.x release.

Ah, yes, sorry, I was predicating that on saving this for 2.0, per @MasonProtter's point.

That has the issue that it makes String(take!(io)) copying and therefore slow

Yeah, that's what I was hoping to avoid by forcing everyone to do the replacement from String to String! for those cases.

NHDaly on 10 Jul 2019

If we're willing to wait until 2.0 to do this then yes, we can deprecate String for String! and then later reintroduce String to always make a copy. But for 2.0, as I said above, we may well have the ability to freeze and thaw arrays and require passing a frozen byte array to String.

StefanKarpinski on 10 Jul 2019

One thing to keep in mind with this discussion is that people often automatically generate constructors via a where T and then do T(data).

On one hand, we don't want to suddenly make that pattern less performant when applied to strings, but on the other hand, I think it's fair to say that I would be pretty baffled if I wrote something like T(data) and suddenly my data disappeared. So the freeze-thaw thing (#31630) does kinda sound like the best of both worlds.

MasonProtter on 11 Jul 2019

requires a frozen vector

In Keno's proposal you get that by making a copy of the vector. So we could also just use the existing copy function, if that's what we want. In the future (independent of #31630), in some cases, we can implement a copy-elision pass. It doesn't make much difference to the compiler if you call it copy-elision or freeze-elision—in this case they're probably much the same computation since the result escapes (aka, is returned from a function). In the string-builder case, the compiler already has a really hard time figuring out types (which are always immutable). Realistically, I'm not confident the compiler will be ever able to elide this particular copy in the cases we most care about.