Clap: OsStrExt3 transmutes from an &[u8] to a OsStr

Created on 22 Jul 2019 · 11Comments · Source: clap-rs/clap

In this code:

https://github.com/clap-rs/clap/blob/784524f7eb193e35f81082cc69454c8c21b948f7/src/osstringext.rs#L23-L32

the transmute casts the &[u8] to a &OsStr. There are a couple problems with this:

This is not actually a safe thing to do, since &[u8] can be an arbitrary sequence of bytes, where as &OsStr cannot on Windows. On Windows, it internally is WTF-8 and it's not clear what, if anything, goes wrong when it isn't WTF-8. (But if it isn't WTF-8, then it could very well break a perfectly valid internal invariant that leads to UB.) A plausible alternative is to make from_bytes unsafe.
The fact that an &OsStr is internally a &[u8] on Windows that is WTF-8 is an implementation detail, and could actually change, leading to an incorrect transmute.

Is this code still present in clap 3? If so, could someone explain the motivation for this? I'd be happy to try to help brainstorm ways of removing it.

Source

BurntSushi

👍2

Most helpful comment

Yeah, I just don't think transmuting like this is a good idea. WTF-8 is an internal detail, and having an important crate in the ecosystem rely it is just not a good idea. The simplest way around this isn't to use encode_wide/decode_wide (because then you'd have to do everything in UCS-2 space), but rather, to just lossily decode the OsStr to UTF-8. In the vast majority of cases, this would only entail a UTF-8 check on Windows, and in the rare case where you have a OsStr with invalid UTF-16, you incur the allocation.

BurntSushi on 22 Jul 2019

👍2 👎1

All 11 comments

It's still present in v3-master. It's used to handle things like --path=bla and -ofoo.txt, hence trimming -'s and splitting at ='s or at specific positions.

These need redoing to use encode_wide()/decode_wide() I guess.

Out of interest, I just threw together this, which still assumes UTF/WTF-8, but at least tries to uphold the invariants. My editor crashed three times writing it but I'm sure it's fiOH GOD RAPTORS.

Freaky on 22 Jul 2019

BurntSushi on 22 Jul 2019

👍2 👎1

Without encode/decode_wide I don't see any way to handle it both correctly and safely. Lossy decoding should be right out, because it implies corrupting some valid program arguments, so we're just left with panicking on invalid UTF-8 on Windows.

Freaky on 22 Jul 2019

I am operating at a loss here, because I don't understand the context in which these routines are being used. Certainly, what you're saying is not _generally_ true. For example, starts_with can be implemented on Windows by lossily decoded the OsStr to UTF-8. The only case you'd miss out on is when the starts_with parameter contains a WTF-8 encoding of invalid UTF-16. So the question then becomes, in what circumstances does the parameter to starts_with actually contain meaningful WTF-8? Ideally, the answer to that is "never."

Taking a step back and re-reading your comment above, maybe now I'm starting to realize here. Specifically, I guess the OsStr is the thing itself that you need to parse, so even if you did a lossy decode, that wouldn't help you, because you wouldn't be able to easily move back to OsStr when it comes time to hand the arguments back to the caller.

I think I now see the predicament, I think you're right. I see three possible choices:

Push on OsStr to get string-like methods. I think there was an RFC for it, but I'm not sure if there has been any movement on it.
Use encode_wide to get UCS-2 code units and parse those directly. Then use decode_wide to convert the arguments back to OsString.
Use encode_wide, re-implement WTF-8 and translate the UCS-2 code units to WTF-8, and then treat it as you are today. To get back, you then need to re-apply decode_wide to get OsString for every argument.

The latter two imply quite a bit of work, and at least some additional performance overhead. _However_, the performance overhead would only occur when the OsStr couldn't be translated to UTF-8. When it can be converted to UTF-8, then you'd just do that.

cc @SimonSapin - As the architect of WTF-8, what do you think the suggest path here should be? (I still continue to think the internal representation should just be exposed, despite the strong principles against doing so. The fact that we have people transmuting to the internal representation is going to make it de facto exposed eventually anyway.)

BurntSushi on 22 Jul 2019

As an exercise I tried implementing the trait methods safely and came up with this.

On Unix it just deals with byte slices, on Windows it tries converting to a str and falls back to a Vec<u16>, with the methods returning Cow<OsStr>.

Freaky on 23 Jul 2019

That's a lot more code, but on an initial skim, it looks good? It does seem unfortunate that everyone has to pay for the Cow though, although the cost should be pretty small.

BurntSushi on 23 Jul 2019

We have an RFC that changes the memory representation of OsStr on Windows: https://rust-lang.github.io/rfcs/2295-os-str-pattern.html / https://github.com/rust-lang/rust/issues/49802. This RFC is accepted, but not implemented yet.

Officially exposing the byte representation would likely make this kind of change a breaking change. I would much prefer adding string-like methods to OsStr. In fact this is exactly what this RFC does. But indeed there hasn’t been much activity to implement since it was accepted a while ago.

SimonSapin on 23 Jul 2019

Had a go at integrating OsStrOps and came up with that. Can surely be improved, but passes the test suite on Windows and FreeBSD.

Freaky on 1 Aug 2019

❤1

Sorry for the force-pushes, clearing a few leftovers I missed.

Freaky on 1 Aug 2019

If you're interested, I created OsStr Bytes to solve this problem. It allows accessing the bytes of OsStr and OsString safely, without making assumptions about how they're represented.