Rust: Tracking issue for integer types conversion to and from byte arrays

Created on 2 Aug 2018 · 40Comments · Source: rust-lang/rust

A previous incarnation of this API that went though FCP in https://github.com/rust-lang/rust/issues/49792, but stabilization was reverted while still in Nightly in favor of this API.

Implemented in https://github.com/rust-lang/rust/pull/51919, conversions between all integer types and appropriately-sized arrays of u8, using the specified endianness:

impl $Int {
    pub fn to_ne_bytes(self) -> [u8; mem::size_of::<Self>()] {…}
    pub fn to_le_bytes(self) -> [u8; mem::size_of::<Self>()] {…}
    pub fn to_be_bytes(self) -> [u8; mem::size_of::<Self>()] {…}
    pub fn from_ne_bytes(bytes: [u8; mem::size_of::<Self>()]) -> Self {…}
    pub fn from_le_bytes(bytes: [u8; mem::size_of::<Self>()]) -> Self {…}
    pub fn from_be_bytes(bytes: [u8; mem::size_of::<Self>()]) -> Self {…}
}

ne, le, and be mean native-endian, little-endian, and big-endian respectively. Native-endian means the target platform’s endianness. Those conversions are implemented as literally transmute and nothing else (but safe). The other conversions use swap or do not swap the byte order depending on the target’s endianness.

B-unstable C-tracking-issue T-libs disposition-merge finished-final-comment-period

Source

SimonSapin

👍20

Most helpful comment

ne [...] means native-endian

I keep reading it as "not equal".

shepmaster on 16 Aug 2018

👍7

All 40 comments

I think this feature is ready for stabilization:

@rfcbot fcp merge

This is very soon after landing in Nightly in its current form, but the same functionality already went through FCP for stabilization in https://github.com/rust-lang/rust/issues/49792 after a couple months in Nightly. The new API contains methods identical to the previous API (only renamed), plus new methods that implement what was previously documented as a typical usage of the previous API.

So there isn’t really anything new here, I expect this final period to be a formality except for possible last-minute naming bikeshed. (The methods landed with consistent abbreviation, another option that was considered was fully expansion with for example to_native_endian_bytes and from_big_endian_bytes.)

SimonSapin on 7 Aug 2018

Team member @SimonSapin has proposed to merge this. The next step is review by the rest of the tagged teams:

[x] @Kimundi
[x] @SimonSapin
[x] @alexcrichton
[x] @dtolnay
[ ] @sfackler
[ ] @withoutboats

Concerns:

~~missing_impls~~ resolved by https://github.com/rust-lang/rust/issues/52963#issuecomment-414050854

Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

rfcbot on 7 Aug 2018

:bell: This is now entering its final comment period, as per the review above. :bell:

rfcbot on 9 Aug 2018

@rfcbot concern missing_impls

I just noticed that we are still missing half the implementation for this.

51919 did just add the changes for signed integers, but not for unsigned.

Compare:

https://doc.rust-lang.org/nightly/std/primitive.u8.html#method.to_bytes
https://doc.rust-lang.org/nightly/std/primitive.i8.html#method.to_ne_bytes

cc @tbu-

Kimundi on 11 Aug 2018

@Kimundi I belive https://github.com/rust-lang/rust/pull/53358 fixes your concern. (Could you review it?)

SimonSapin on 14 Aug 2018

ne [...] means native-endian

I keep reading it as "not equal".

shepmaster on 16 Aug 2018

👍7

@shepmaster I agree that’s a real negative point. The other naming pattern that was considered (consistent full expansion) is rather verbose: from_native_endian_bytes, to_little_endian_bytes. Do you think it would still be preferable? Or perhaps some other set of names?

SimonSapin on 17 Aug 2018

@SimonSapin: Yeah, seems to be all in order now.

@rfcbot resolved missing_impls

Kimundi on 18 Aug 2018

:bell: This is now entering its final comment period, as per the review above. :bell:

rfcbot on 18 Aug 2018

I'm here from reddit's _This Week in Rust 248_, if we're still doing bike shedding: has it been considered to just use to_bytes and from_bytes for the native endian names _along side_ the little & big endian names?

That would resolve the awkward names for the native endian case while still having the little & big endian versions.

EDIT: It was considered here.

CasualX on 25 Aug 2018

@CasualX As discussed after the comment you point to, native-endian is usually not what you want, so we probably not give them shorter names that make them look like an appropriate default. At least that’s my opinion.

SimonSapin on 26 Aug 2018

👍2

@SimonSapin I understand but... I can't help but feel a lack of _composition_ of these functions as they combine transmutation and interpretation of endianness. I feel unsure that trying to prevent misuse of the API is enough to overcome the elegance of composition.

I know it's a bit late for this but... I'm not sure if this could have been handled by a lint? Calling to/from bytes without immediately using one of the endianness interpretation functions to trigger a warning lint. Composition and nudging people to the right solution at the same time.

CasualX on 26 Aug 2018

In my opinion we could just not bother with the native endianness versions and let people use things like the following if they really need native endianness for whatever reason.

let bytes = if cfg!(target_endian = "little") {
    x.to_le_bytes()
} else {
    x.to_be_bytes()
};

That way it's very clear something platform specific is happening.

ollie27 on 27 Aug 2018

❤1

I think Rust is still a low-level language, and re-interpreting an integer as bytes (converting an integer to native endian) is something it needs to be able to do. I think it would be better to keep {to,from}_ne_bytes in some form (names aside).

tbu- on 27 Aug 2018

@CasualX I agree that this elegance of composition is tempting. It is why we landed this functionality with only native endian at first: https://github.com/rust-lang/rust/issues/49792. However in this case I find the elegance to be ruined by to_be, from_be (similarly _le) being just plain weird, with their input and output types being the same. Why do they both exist with different names (and slightly different interfaces) when their implementations are identical? I can use them in ways that are completely meaningless: u16::from_be(u16::from_be(1000).to_be().to_be().to_be()). When I want to use them "correctly", I need to pause and think carefully about the direction of the conversion in order to pick one. (Even though, again, they do the same thing.)

If an API requires the crutch of a lint to not be a footgun, is it really that elegant? And for the few cases where native-endian is appropriate, requiring users to #[allow(…)] a lint is not great.

@ollie27 With some inlining, your code example becomes:

let bytes: [u8; $N] = if cfg!(target_endian = "little") {
    transmute::<_, [u8; _]>(
        #[cfg(target_endian = "little")]
        {
            x
        }
        #[cfg(not(target_endian = "little"))]
        {
            x.swap_bytes()
        }
    )
} else {
    transmute::<_, [u8; _]>(
        #[cfg(target_endian = "big")]
        {
            x
        }
        #[cfg(not(target_endian = "big"))]
        {
            x.swap_bytes()
        }
    )
};

… which is a rather convoluted way to do transmute(x). I agree with @tbu- that providing a direct safe wrapper for this specific transmute is desirable.

SimonSapin on 27 Aug 2018

@SimonSapin

I have the impression that is criticism of the endianness conversion functions in isolation and I agree that can be tricky to get right. However when composed with a conversion to and from bytes the ambiguity goes away, there's only one way to compose these functions together and that is the correct one:

// Given to_be/to_le and to_bytes, only one way to put them together:
let _ = word.to_be().to_bytes();
let _ = word.to_le().to_bytes();
// Given from_be/from_le and from_bytes, only one way to put them together:
let _ = u32::from_be(u32::from_bytes(bytes));
let _ = u32::from_le(u32::from_bytes(bytes));
// Cheeky :P
let _ = (u32::from_be . u32::from_bytes)(bytes);

You cannot get the le/be choice incorrect as that _is_ what you're trying to do and you cannot get the order between byte conversion and endian conversion wrong as the types don't line up. This can further be documented with examples in the API docs or lints.

I'm... not convinced linting is a crutch? Some fairly core aspects of Rust are encoded in lints, eg #[must_use].

Reading the previous conversations I take away the following conclusions:

Do not want to encourage incorrect usage of byte conversion without specifying endianness.
The unified functions are slightly shorter than composing two functions.
Loss of composition is a fairly heavy price to pay as it introduces 3x the number of functions.
Little endian has been ossified basically everywhere (low level network protocols being a notable exception).

I'm... not sure where I'm going with this, I feel the plight to encourage the correct answer, but the current API feels too forced in a world where the assumption of little endian is everywhere.

I do not have much skin in this game, the only place where I can use these API is in a crate reading immediates and offsets from byte slice representing x86 instructions which I now realize probably should have explicit conversion from little endian...

CasualX on 27 Aug 2018

in a world where the assumption of little endian is everywhere

I’m not convinced this is an assumption we can make when designing this API. Yes many (most ?) of the machines you will encounter use little-endian as their native endianness, but the code you write with this API might be working with a specific protocol or file format that specifies an endianness regardless of what is native on the machine.

SimonSapin on 27 Aug 2018

In the previous world, you'd use to_be, to_le, from_be, to_be, to_bytes, from_bytes (6 functions).

Now you can use to_be_bytes, to_ne_bytes, to_le_bytes, from_be_bytes, from_ne_bytes, from_le_bytes (6 functions). I recognize that {to,from}_{be,le} are not going away, but they're part of the old world. :)

The only reason why to_be etc. exist, is because integers are sometimes viewed as byte containers in system programming languages. If we get rid of this notion (e.g. look at Python, where integers don't have a fixed size), then it becomes clear that the function to_be doesn't really work anymore. What would it mean to put an integer into 'little endian', if viewed as integer, and not as byte container.

tbu- on 27 Aug 2018

The final comment period, with a disposition to merge, as per the review above, is now complete.

rfcbot on 28 Aug 2018

@tbu-

My calculation is slightly different:

You have two categories of functions: to_be | to_le composed with to_bytes and from_bytes composed with from_be | from_le. it's not simply '6' functions, but a deeper hierarchy of composable functions.

Because the functions are multiplied for every integer type, I get a total of 2 extra functions per integer type as opposed to adding 6 extra functions. This _really_ adds up for a total of 12 integer types...

The bigger point you're making (I think) is that you think of the old functions as fundamentally 'meaningless' with comparison to python and I agree, in the context of python these functions could not be defined as they are meaningless. But this is Rust, not Python and the distinct choice of _having_ fixed width integer types for all sizes necessarily implies that endianness is a thing that exists and shouldn't be shoved under the rug for the sake of abstraction.

And here I think I have a different view about when thing should be abstracted away; coming from a C/C++ background endianness has always been an intrinsic part of the integer types (granted, which you can pretend doesn't exist most of the time) but that doesn't mean it deserves to be hidden away.

One final example is a technique where the contents of a byte stream are casted to a repr C struct, of which the members are then read/written, in this scenario the separate endianness conversion functions are needed. Note that this case is entirely absent from python (and others like it) which require a different abstraction to deal with and where only the combination of endianness conversion and byte conversion makes sense, but this is less so the case in Rust.

@SimonSapin

I feel like at this point in the discussion we've reached a point where we actually mostly agree with each other and the difference in our positions is just a small shift in the weights we assigned to the importance of some factors in the argumentation which lead to different conclusions.

I'm... not sure how to continue here? I've brought up the arguments that I find important so I'll leave it at that.

CasualX on 28 Aug 2018

I'm... not sure how to continue here?

At this point a form of this feature is implemented in Nightly, FCP to stabilize was proposed, a quorum of team members approved, and the official final comment period has elapsed. My opinion of the discussion here is that it doesn’t really make new arguments or another proposal, or provide new information that hasn’t already been discussed in https://github.com/rust-lang/rust/issues/49792 or https://github.com/rust-lang/rust/pull/51919. As such I feel that there isn’t a large enough concern with this API that we should block it further, and that the way to continue here is a stabilization PR.

Does this sound fair? Or does someone feel more strongly than I realized that we should make further changes to this API?

SimonSapin on 28 Aug 2018

❤2

That sounds fair, thanks for indulging my commentary.

CasualX on 29 Aug 2018

I didn't see it mentioned in this thread, so just a small comment about the the naming of ne – before reading description, I've immediately thought about this as "network endianness" (which is basically synonymous to be) only to learn later it was expected to mean "native endianness". Perhaps we can use he for "host endianness" to avoid ambiguity? This also plays nicely with htons and ntohs-like names. Although, h and b are quite similar, so this choice has its flaws.

krdln on 31 Aug 2018

👍1

Ok, this is the second (different) misinterpretation of ne (the other being "not equal") reported in this thread, so maybe we should do something about it? As mentioned previously I think that this meaning of "host" only makes sense in the context of networking, but this APIs are not necessarily networking-related. But other than that, maybe he is not too bad.

@rust-lang/libs, any thoughts?

SimonSapin on 31 Aug 2018

Host (he) will give the wrong association if interpreted in the cross compilation context (host vs target). Making it he might therefore also be confusing. I think this was mentioned in some previous discussion?

If the name is being changed it should preferably be to something that will resolve all ambigouties, not change some ambigouties for different ones. Due to the abbriviation it is already expected that users will have to look up the function in the docs the first time they're using it. Because of this, less idiomatic alternatives might be OK. In these situations there are also arguments for the idiosyncratic alternative to be better, as it makes it difficult to assume things, forcing users to look it up in the docs.

What about:

me - machine endianness (might look to similar to ne?)
ae - architecture endianness
te - target endianness (not common to speak of target unless cross compiling as target==host otherwise)
pe - processor/platform endianness
cpue - quite descriptive but doesn't visualy look like be or ne. It might make sense that the longest alternative is the one you're least likely to use (most prone to bugs due to platform specific behavior)

kjetilkjeka on 31 Aug 2018

"Host" makes the wrong kind of sense in cross-compilation context, since this API is using the target machine’s endianness with #[cfg(not(target_endian = "…"))]

SimonSapin on 3 Sep 2018

"Host" makes the wrong kind of sense in cross-compilation context, since this API is using the target machine’s endianness with #[cfg(not(target_endian = "…"))]

Yes this was my exact concern. That's why i suggested to change it to something other than he if a change was going to happen. I see now that my comment might be interpreted the other way as well, I'll do a minor edit to fix that.

kjetilkjeka on 3 Sep 2018

If there's confusion about the short names then I'd personally err on the side of more wordy names and expand out he to host_endian or (and similar for other names)

alexcrichton on 3 Sep 2018

👍2

aww yis. Excited for this one. It makes some clean looking code.

    let magic_bytes = header.magic.to_ne_bytes();
    if magic_bytes != ['h', 'p', 'k', 'r'] {

kallisti5 on 18 Sep 2018

@kallisti5 Did you mean to write byte character literals instead of chars? Assuming they're meant to be byte literal characters you can go even further and do it like this ;)

    let magic_bytes = header.magic.to_ne_bytes();
    if magic_bytes != *b"hpkr" {

CasualX on 18 Sep 2018

Not sure if it's too late, but is there a reason the from routines take in an array rather than a slice? Taking a random chunk of out a slice and converting it to a number would be rather clunky as it would have to be copied to an intermediate fixed size array for holding.

In terms of performance a slice version would only have a runtime cost if used with an actual slice, since it's inline the bounds check on a fixed size array would be omit.

I ran into this today and it was a bit annoying.

gamozolabs on 9 Oct 2018

take in an array rather than a slice

By taking in an array, you are statically guaranteed that there's enough data to make the transformation — you don't have to even allow for the error case.

Taking a random chunk of out a slice and converting it to a number would be rather clunky as it would have to be copied to an intermediate fixed size array for holding.

With TryFrom / TryInto, it's not that bad:

#![feature(int_to_from_bytes, try_from)]

use std::convert::TryInto;

fn example(bytes: &[u8]) -> i32 {
    let bytes: &[_; 4] = bytes.try_into().expect("Wrong number of bytes");
    i32::from_be_bytes(*bytes)
}

shepmaster on 9 Oct 2018

Yeah, I agree, but it seems that a slice is just a superset of the abilities of an array.

While I agree it's cleaner and makes more sense in a standalone implementation, I don't know of many situations where I'm dealing with something I need to convert into a number did not come from a bigger vector or slice. At least for me I don't ever see using this functionality without using a wrapper of the sorts of the snippit you posted.

I guess I'm not sure how often you are dealing with packed data and are working with fixed sized arrays as the source of this data is almost always network or file streams. The only place I work with fixed sized arrays when working with files/network is when I'm casting to a structure, and at that point I'm just going to directly cast it to the u64 and skip the middle step.

The TryInto/TryFrom implentation you mention now is exactly what I expose as a basic interface in most of my tiny applications that don't warrant my safecast library. In this case the new introduction of the from_*_bytes() interfaces provide no improvement to usability compared to what I already have as I would have to write the example wrapper you posted above, which is of the same complexity and lesser portability than my existing version (lesser portability due to relying on another experimental feature).

I have a library https://github.com/gamozolabs/safecast which allows me to safely cast or copy things with runtime checks if needed. I pretty much must use this library in all of my Rust projects as almost everything I do is working with network/file/system level packed data.

To me this just happens to be a very sensitive issue as operating on binary data from safe Rust is incredibly clunky, and is the last major issue I have personally with the language. While crates like byteorder exist I do not believe a systems language should require a third party piece of code to be able to manipulate binary data in a safe way.

No hard feelings, just trying to chime in about the usability of this API from my use cases (kernel development / fuzzer development). Many of my peers (other security researchers) who also work on developing code to parse/mutate/extract information from binary packed data have had similar gripes and ultimately a few have refused to use the language for these purposes as it's not feasible for codebases that largely operate on raw binary data.

I likely will propose an RFC along the lines of my safecast library which allows for safe copying and dynamically-checked casting of PoD types that would just be exposed as another derive for structures/enums.

-B

gamozolabs on 9 Oct 2018

👍1

As shepmaster suggested, what does an API taking a slice do if the input is too short? Too long? Does it panic, or ignore extra items, or do we change the return type to a Result? Taking an array avoids this question, and the concerns of how to obtain that array can be separated and left to other APIs like TryFrom.

In my opinion the main goal of this API is to replace some unsafe calls to transmute with safe method calls. It doesn’t by itself need to replace byteorder or safecast entirely, and I think it’s fine to both have this and later add some higher-level convenience APIs to the standard library.

SimonSapin on 9 Oct 2018

👍1

I'm not familiar with exactly what #[inline] means in the compiler, but I notice that byteorder also uses it, and yet I think I've seen cases where I still ended up paying the cost of bounds checks.

oconnor663 on 13 Oct 2018

#[inline] is documented at https://doc.rust-lang.org/reference/attributes.html#inline-attribute

SimonSapin on 13 Oct 2018

Right, I meant to say, I'm not sure what makes the compiler decide when to follow the hint and when not to. I thought I'd seen an issue with byteorder::LittleEndian::read_u64 not eliding the bounds checks when it could've, but now that I look more closely at my own code, it looks like it's actually read_u64_into (which copies a bunch of ints at once) that's somehow failing to elide. So maybe not relevant to our discussion here.

oconnor663 on 14 Oct 2018

FCP to merge has completed a while ago, but there has been a number of additional comments since. However I believe not enough reasons to change the API. If you disagree, now is (past) the time to say so. Here is a stabilization PR: https://github.com/rust-lang/rust/pull/56207

SimonSapin on 25 Nov 2018

As shepmaster suggested, what does an API taking a slice do if the input is too short? Too long? Does it panic, or ignore extra items, or do we change the return type to a Result? Taking an array avoids this question, and the concerns of how to obtain that array can be separated and left to other APIs like TryFrom.

Note, that this was the same with the API for char::encode_utf8 (which was char::encode_utf8(&mut [u8; 4]) -> usize, which was later changed to its current char::encode_utf8(&mut [u8]) -> &str (panicking if the buffer is not large enough) for convenience reasons.

Not that I’ve have any preference either way.