Several issues have been filed about surprising behavior of NaNs.
0.0 / 0.0 changed depending on whether the right-hand side came from a function argument or a literal.f32::from_bits(x).to_bits() was not always equal to x.The root cause of these issues is that LLVM does not guarantee that NaN payload bits are preserved. Empirically, this applies to the signaling/quiet bit as well as (surprisingly) the sign bit. At least one LLVM developer seems open to changing this, although doing so may not be easy.
Unless we are prepared to guarantee more, we should do a better job of documenting that, besides having all 1s in the exponent and a non-zero significand, the bitwise value of a NaN is unspecified and may change at any point during program execution. In particular, the from_bits method on f32 and f64 types currently states:
This is currently identical to transmute::
(v) on all platforms.
and
this implementation favors preserving the exact bits. This means that any payloads encoded in NaNs will be preserved
These statements are misleading and should be changed.
We may also want to add documentation to {f32,f64}::NAN to this effect, see https://github.com/rust-lang/rust/issues/52897#issuecomment-496672336.
cc #10186?
This also affects the documentation for the methods in #72568.
@ecstatic-morse wrote elsewhere
Indeed. The underlying cause is clear. I wonder what we should do here, though? Does Rust currently guarantee that extended precision is not used for operations on f64? If so, this is technically a miscompilation. However, I don't know whether it's worth fixing. Maybe we should just document the status quo and move on?
I don't think we can easily just "move on" -- as mentioned here, what LLVM currently does seems incoherent and is likely just plain unsound (but miscompilations are hard to trigger). In that sense this is similar to https://github.com/rust-lang/rust/issues/28728: LLVM in its current state makes it impossible to build a safe language on top of it with reasonable effort, which means fixing this will be a lot of work, but from a Rust perspective that's nevertheless a critical soundness bug.
Cc @rust-lang/lang
That issue does not involve NaN, and that comment is not applicable here.
Fair. But I feel https://github.com/rust-lang/rust/issues/72327 is related in the broader sense of "our FP semantics are a mess". Looks like we actually have two problems here:
I created https://github.com/rust-lang/unsafe-code-guidelines/issues/237 to collect FP issues. That's indeed off-topic here, sorry for that.
Related LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=45152
Unless we are prepared to guarantee more, ... the bitwise value of a NaN is unspecified and may change at any point during program execution
This seems... way too conservative. I know it's trying to make the best of a bad situation, and I'm sympathetic here, but please realize how hard overly broad unspecified behavior like this makes it to write robust code (As a user of Rust who came to it from C, this feels like the same kind of undefined behavior you see in the C standard in cases where all supported platforms disagree).
So, my biggest concern is non-Wasm platforms. I think it would really be a huge blow to working with floats in rust to effectively zero guarantees around NaN. I don't really know a good solution here, but even just marking it as a LLVM bug on the problematic platforms (rather than deciding that this isn't a thing that Rust code gets to rely on ever) would be much better.
Just as an example, if NaN payload is totally unspecified and may change at any point, implementing any ordering stronger than PartialEq for floats is impossible (including https://github.com/rust-lang/rust/issues/72599), as you cannot count on NaN bitwise values to be stable across two calls of to_bits() on the same float.
Same goes for things that stash f32 in a u32 and then expect to get it out again and be the same (for example, I implemented an AtomicF32 at one point on top of AtomicU32 + from_bits/to_bits. If I can't rely on stable bit values though from float => u32, things like compare_exhcange loops become not guaranteed to ever terminate.
Tbat said, I also "totally unspecified behavior" is too conservative on Wasm too — I've done a bit of poking and it seems like the behavior is a lot more sane than suggested, although it does violate IEEE734 and is probably not 100% intentional.
Basically: LLVM's behavior here is inherited from the wasm/js runtime, which canonicalizes NaNs whenever going from bits => float, as it wants to be able to guarantee certain things about which bit patterns are possibly in the float — certain NaNs are off limits.
That means:
f32::from_bits(x).to_bits() round trip failureThis is non-ideal but is still way easier to reason about and build on top of than arbitrary unspecified behavior.
Yeah that's the basic gist of my thoughts. Changing the documented guaranteed of from_bits/to_bits globally like that would totally neuter those APIs. I'm sympathetic to the position you're in and not having great choices, but that kind of change feels like very much the wrong call, and making the call be this kind of unspecified behavior feels really bad on any platform...
P.S. I accidentally posted an incomplete version of this comment by hitting ctrl+enter in the github text box, sorry if you saw that — really should just do these in a text editor first.
I am open to better suggestions. I know hardly anything about floating point semantics, so "totally unspecified" is an easy and obviously "correct" choice for me to reach for. If someone with more in-depth knowledge can produce a spec that is consistent with LLVM behavior, I am sure this can be improved upon.
However, the core spec of Rust must be platform-independent, so unless we consider this a platform bug (which I think is what we do with the x87-induced issues on i686), whatever the spec is has to encompass all platforms.
In principle, certain platforms can decide to guarantee more than others, but that is a dangerous game as it risks code inadvertently becoming non-portable in the worst possible way -- usually "non-portable" means "fails to build on other platforms", now it would silently change behavior. Maybe we can handle this in a way similar to endianess, although the situation feels different.
And all of this is assuming that we can get LLVM to commit to preserving NaN payloads on these platforms. You are saying that this issue only affects wasm(-like) targets, but is there a document where LLVM otherwise makes stronger guarantees? The fact that issues only have been obvserved on these platforms does not help, we need an explicit statement by LLVM to establish and maintain this guarantee in the future.
Just as an example, if NaN payload is totally unspecified and may change at any point, implementing any ordering stronger than PartialEq for floats is impossible (including #72599), as you cannot count on NaN bitwise values to be stable across two calls of to_bits() on the same float.
So if I understand correctly, on wasm, the float => bit cast that is inherent in such a total order would canonicalize NaNs. This on its own is not a problem as this is a stable canonicalization, and that's why you think "unstable NaNs" are too broad. Is that accurate?
However, when you combine that with LLVM optimizing away "bit => float => bit" roundtrips (does it do that?), then this already brings us into an unstable situation. Some of the comparisons might have that optimization applied to them, and others not, so suddenly the same float (obtained via a bit => float cast) can compare in two different ways.
It is easy to make a target language spec such as wasm self-consistent, but to do the same on a heavily optimized IR like LLVM's or surface language like Rust is much harder.
So if I understand correctly, on wasm, the float => bit cast that is inherent in such a total order would canonicalize NaNs.
No, float => bit should always* be stable, it's bit => float that canonicalizes. This means it's possible to implement a robust totalOrder without issues on Wasm (just not if all nan payloads are unspecified values which may change at any time).
My point with that paragraph was not that the LLVM behavior is bad (although I am not a fan), but that changing Rust's guarantees to: "the bitwise value of a NaN is unspecified and may change at any point during program execution" is both
* (always... except for what I say in my next response)
However, when you combine that with LLVM optimizing away "bit => float => bit" round-trips (does it do that?)
I don't know if it does it on Wasm, but it's obviously free to do this on non-Wasm platforms (and I think I've seen it there, but it's hard to say and I don't have code I'm thinking of on hand).
I'd hope it wouldn't do this on Wasm, and would argue that if it does optimize that away it's an LLVM bug for that platform, but... yeah. Possible.
unless we consider this a platform bug (which I think is what we do with the x87-induced issues on i686)
Honestly that seems like the sanest decision to me, since the alternative is essentially saying that Rust code can't expect IEEE754-compliant floats anymore. And so, I think x87 is a good example because it's also an example of non-IEEE754 compliance, although probably a less annoying one in practice.
Concretely, I wouldn't have complained about this at all if it were listed as a platform bug.
Instead, my issue is entirely with all compliant Rust code loosing the ability to reason about float binary layout, which has been extremely useful in stuff like scientific computing, game development, programming language runtimes, math libraries, ... All things Rust is well suited to do, by design.
This wouldn't cripple those by any means, but it would make things worse for several of them.
Admittedly, in practice, unless it's flat out UB, I suspect people will just code to their target and not to the spec, which isn't great either, but honestly to me it feels like it might be better than Rust genuinely inheriting this limitation from the web platform.
(Ironically, this would also prevent writing a runtime in Rust that does the optimization which is the reason Wasm and JS runtimes want to canonicalize their NaNs. Although that optimization was already fairly unportable anyway)
No, float => bit should always* be stable, it's bit => float that canonicalizes.
Oh I see... but that is not observable until you cast back? Or does wasm permit transmutation, like writing a float into memory and reading it back as an int without doing an explicit cast? (IIRC their memroy is int-only so you'd have to cast before writing, but I might misremember.)
I don't know if it does it on Wasm, but it's obviously free to do this on non-Wasm platforms (and I think I've seen it there, but it's hard to say and I don't have code I'm thinking of on hand).
I'd hope it wouldn't do this on Wasm, and would argue that if it does optimize that away it's an LLVM bug for that platform, but... yeah. Possible.
Whether it can do that or not depends solely on the semantics of LLVM IR, which (as far as I know) are not affected by whether you are compiling to Wasm or not. That is the entire point of having a single uniform IR.
There is no good way to make optimizations in a highly optimized language like Rust or LLVM IR depend on target behavior -- given how they interact with all the other optimizations, that is basically guaranteed to introduce contradicting assumptions.
Also, I don't think there is much point in discussing what we wish LLVM would do. We first need to figure out what it is doing.
(Ironically, this would also prevent writing a runtime in Rust that does the optimization which is the reason Wasm and JS runtimes want to canonicalize their NaNs. Although that optimization was already fairly unportable anyway)
Ah, but this is getting to the heart of the problem -- what if you implement a wasm runtime in Rust which uses this optimization, and compile that to wasm? Clearly that cannot work as the host wasm is already "using those bits". So, it is fundamentally impossible to have a semantics that achieves all of
Instead, my issue is entirely with all compliant Rust code loosing the ability to reason about float binary layout, which has been extremely useful in stuff like scientific computing, game development, programming language runtimes, math libraries, ... All things Rust is well suited to do, by design.
I do feel like it is slightly exaggarated to say that all these usecases rely on stable NaN payloads. That said, there seems to be a fundamental conflict here between having a good cross-platform story (consistent semantics everywhere) and supporting low-level floating point manipulation. FP behavior is just not consistent enough across platforms.
However, note that not just wasm has strange NaN behavior. We also have some bugs affecting x86_64: https://github.com/rust-lang/rust/issues/55131, https://github.com/rust-lang/rust/issues/69532. Both (I think) stem from the LLVM constant propagator (in one case its port to Rust) producing different NaN payloads than real CPUs. This means that if we guarantee stable NaN payloads in x86_64, we have to stop const-propagating unless all CPUs have consistent NaN payload (and then the const propagator needs to be fixed to match that).
So until LLVM commits to preserving NaN payloads on some targets, there is little we can do. It seems people already rely on that when compiling wasm runtimes in LLVM that use the NaN optimization, so maybe it would not be too hard to convince LLVM to commit to that?
That is the entire point of having a single uniform IR.
This isn't really right tho is it? LLVM-IR includes tons of platform specific information. The fact that making LLVM-IR cross platform was non-viable was part of the motivation behind Wasm's current design even.
From the other issue:
A less drastic alternative is to say that every single FP operation (arithmetic and intrinsics and whatnot, but not copying), when it returns a NaN, non-deterministically picks any NaN representation.
This would be totally fine with me FWIW — as soon as you do arithmetic on NaN all portability is out the window in practice and in theory. My concern is largely with stuff like:
Stuff like https://searchfox.org/mozilla-central/source/js/rust/src/jsval.rs suddenly breaking, just as a quick file I remember from my last job as doing stuff that depends on this.
APIs like https://doc.rust-lang.org/core/arch/x86_64/fn._mm_cmpeq_ps.html being in a limbo where nothing guarantees that it works... even though it obviously must work or is a compiler bug.
For context here: this API is one of many SIMD intrinsic apis where you have shortlived NaNs in float vectors where the payload is very important.
Specifically this function will return a float vector (yes, float — __m128i would be the type for an int vector) with an all-bits-set f32 for every slot where the comparison succeeded. One of the ways you're intended to use the result is as a bitmask, to find the elements where the comparison succeeded/failed.
Since all-bits-set is a NaN with a specific payload, this requires the payload be preserved here
So, while I just gave you two examples of very much non-portable code...
core::arch broken — even if portable simd is on the way).My big concern still comes back to the notion that these payloads are "unspecified values which may change at any time" according to Rust. The way I interpret that, and the general feeling of this conversation, means that there's no guarantee that target-specific things like these are even guaranteed to work reliably on the target in question.
I do feel like it is slightly exaggarated to say that all these usecases rely on stable NaN payloads
That's why I said "This wouldn't cripple those by any means", although honestly the SIMD stuff would be pretty bad if it were actually broken.
I also fully expect those cases to blindly continue doing things to NaN non-portably (and possibly non-deterministically).
This means that if we guarantee stable NaN payloads in x86_64, we have to stop const-propagating unless all CPUs have consistent NaN payload (and then the const propagator needs to be fixed to match that).
This is surprising, because I thought it was the whole point of LLVM's APFloat code (which even goes as far as to support like the horrible PowerPC long double type...). That said, it's not like I can argue with facts, if those bugs are happening, then they're happening... But are we sure those aren't just normal bugs in LLVM?
That said the only reason I wouldn't be willing to say "I don't care that much about what happens to NaN during const prop" is that you can't know when LLVM will happen to see enough to do more const prop.
That said, it seems totally unreasonable and very fragile to me to rely on things like:
That stuff is totally nonportable (IEEE754 recommends but doesn't require any of it) and unreliable both at compile time and at runtime. Again, my concern is more unexpected fallout here in stuff that expects NaN to go through smoothly.
Just took a peek at https://webassembly.github.io/spec/core/exec/numerics.html (and elsewhere in the spec) and regret not doing so sooner. In particular, there's a lot of mention on when canonicalization can happen, but none of the places are on load/reinterpret.
And so what's in there is pretty close to the suggestion you had earlier (the "less drastic alternative)... and what I suggested as the things that are totally nonportable.
And, it also definitely contradicts what I said before about when canonicalization happens (which mirrored what happened in ASM.js, what I seemed to see in my testing earlier, and would have explained from_bits(x).to_bits() not round-tripping... But maybe all of it be the "native doubles used in LLVM MC code" bug? Needs more investigation). That said, this would make things a lot more tractable, since it brings Wasm up to par as compliant IEEE-754 implementation, and (if true) just points the blame at LLVM for messing up...
Which would also (maybe?) explain why the bugs happen on all platforms, maybe?
...
Ugh, this is still a bit jumbled sorry, some it this needs to be unified and reordered, and more digging into what the deal with the discrepancy is, but I have to run, unfortunately.
This isn't really right tho is it? LLVM-IR includes tons of platform specific information. The fact that making LLVM-IR cross platform was non-viable was part of the motivation behind Wasm's current design even.
It makes many platform-specific things such as pointer sizes etc explicit. But that is very different from an implicit change in behavior.
Your proposal would basically require many optimizations to have code like if (wasm) { one_thing; } else { another_thing; }. I do not think such code is common in LLVM today, if it exists at all. It is also very fragile as it is easy to forget to add this in all the right places. In contrast, the explicit reification of layout everywhere is impossible to ignore.
And this would affect many optimizations as it makes float point operations and/or-casts non-deterministic, which is a side-effect! So everything that treats them as pure operations needs to be adjusted.
From the other issue:
There's like 5 other issues, which one do you mean?^^ You are quoting this comment I think.
This would be totally fine with me FWIW — as soon as you do arithmetic on NaN all portability is out the window in practice and in theory.
(This was for making FP operations pick arbitrary NaNs.)
The problem is that this makes them non-deterministic. So e.g. if you have code like
let f = f1 / f2;
function(f, f);
then you are no longer allowed to "inline" the definition of f in both places, as that would change the function arguments from two values with definitely the same NaN payload to potentially different NaN payloads.
However, maybe we can make it deterministic but unspecified? As in, after each floating-point operation, if the result is NaN, something unspecified happens with the NaN bits, but given the same inputs there will definitely always be the same output?
The main issue with this is that it means that const-prop must exactly reproduce those NaN patterns (or refuse to const-prop if the result is a NaN).
My concern is largely with stuff like:
So is it the case that all that code would be okay with FP operations clobbering NaN bits?
My big concern still comes back to the notion that these payloads are "unspecified values which may change at any time" according to Rust.
Rust will probably just do whatever LLVM does, once they make up their mind and commit to a fixed and precise semantics. I think you are barking up the wrong tree here, I don't like unspecified values any more than you do. ;) I am just trying to come up with a consistent way to describe LLVM's behavior.
I'm a theoretical PL researcher, so that's something I have experience with that I am happy to lend here -- define a semantics that is consistent with optimizations and compilation to lower-level targets. However, not knowing much about floating-point makes this harder for me than it is for other topics. So I am relying on people like you to gather up the constraints to make sure the resulting semantics is not just consistent with LLVM but also useful. ;) It might turn out that that's impossible, in which case we can hopefully convince LLVM to change.
This is surprising, because I thought it was the whole point of LLVM's APFloat code (which even goes as far as to support like the horrible PowerPC long double type...). That said, it's not like I can argue with facts, if those bugs are happening, then they're happening... But are we sure those aren't just normal bugs in LLVM?
They might well be bugs! Since you seem to know a lot about floating-point, it would be great if you could help figure that out. :)
That said the only reason I wouldn't be willing to say "I don't care that much about what happens to NaN during const prop" is that you can't know when LLVM will happen to see enough to do more const prop.
Right, that's exactly the point -- const-prop must not change what the program does. So either it must produce the exact same results as hardware, or else we have to say that the involved operation is non-deterministic.
Just took a peek at https://webassembly.github.io/spec/core/exec/numerics.html (and elsewhere in the spec) and regret not doing so sooner. In particular, there's a lot of mention on when canonicalization can happen, but none of the places are on load/reinterpret.
So what is the executive summary?
A quick glance shows that these operations are definitely non-deterministic. So scratch all I said about this above, this basically forces LLVM to never ever duplicate floating-point instructions. Any proposals for (a) figuring out if they are doing this right and (b) documenting this in the LLVM LangRef to make sure they are aware of the problem?
@ecstatic-morse you listed https://github.com/rust-lang/rust/issues/73288 in the original issue here, but isn't that a different problem? Namely, this issue here is about NaN bits in general, whereas #73288 is specific to i686 and thus seems more related to https://github.com/rust-lang/rust/issues/72327. (I don't think we have a meta-issue for "x87 floating point problems", but maybe we should.)
As an aside, I will note that "Unless we are prepared to guarantee more" was doing a lot of work in the OP. I'd be very happy if we came up with a stricter set of semantics that we can support across tier 1 platforms (possibly exempting 32-bit x86) and implemented them. However, doing so will require a non-trivial amount of work, much of it on the LLVM side. I think that, in the meantime, we should explicitly state where we currently fall short in the documentation of affected APIs, similar to #10184. That's what this issue is about.
Also, look out for my latest crate, AtomicNanCanonicalizingF32, on crates.io.
72327 affects only i586 targets (x86 without SSE2). This is a tier 2 platform, and the last x86 processor without SSE2 left the plant about 20 years ago, so I would have no problem exempting it from whatever guarantees around NaN payloads we wish to make. However, #73288 affects i686 (the latest 32-bit x86 target) as well, which is tier 1. Obviously, we could (and maybe should) exempt all 32-bit x86 targets from the NaN payload guarantees, but I consider #73288 to be of greater importance than issues only affecting i586.
Wait, so there's x87-specific bugs even when using SSE2? :cry: and here I was thinking that SSE2 solves the i586 mess.
Yes. The x86 calling convention mandates that floating point values are returned on the FPU stack. Values on the FPU stack are extended-precision, so storing them into an 8-byte f64 involves truncation and thus is an "arithmetic operation", which canonicalizes NaNs, according to the x86 manual.
However, #73288 affects i686 (the latest 32-bit x86 target) as well, which is tier 1.
I said this in Zulip but it belonged here probably.
I came across https://github.com/WebAssembly/design/blob/master/Rationale.md#nan-bit-pattern-nondeterminism (and Also see https://github.com/WebAssembly/design/blob/master/Nondeterminism.md) which is interesting.
IEEE 754-2008 6.2 says that instructions returning a NaN should return one of their input NaNs. In WebAssembly, implementations may do this, however they are not required to. Since IEEE 754-2008 states this as a "should" (as opposed to a "shall"), it isn't a requirement for IEEE 754-2008 conformance.
This answers a lot of questions for https://github.com/rust-lang/rust/issues/73328
Specifcally the way it works is: Certain instructions are not guaranteed to preserve the payload bitpattern (in practice you can't rely on this portably so it seems fine that not to guarantee anything). Specifically:
the instructions fsqrt, fceil, ffloor, ftrunc, fnearest, fadd, fsub, fmul, fdiv, fmin, fmax, promote (
f32 as f64) and demote (f64 as f32): do not preserve the payload or sign bits for non-canonical nans, and do not preserve the sign bit for canonical nans (where "do not preserve means "set to a nondeterministic value").the instructions fneg, fabs, and fcopysign (e.g. "sign bit operations" according to ieee754) fully preserve nan payload, and only modify the sign bit if expected for the operation, and introduce no nondeterminsim. (this is actually a hard requirement of IEEE754 so it's not surprising if they're going with the "technically compliant" argument lol)
All other operations, such as copying values around, loading/storing them to memory, roundtripping arbitrary bitpatterns (including the patterns of noncanonical nans) through float values, using them as args, returning them from functions... these should all preserve sign and payload of nans.
As I mentioned before you can't portably rely on what happens to these NaN payloads if you do math on them, so I don't think what's there is a big deal if this is followed. My big concern was mostly that that last set of things wouldn't work.
A couple of additional notes:
There are probably more LLVM bugs beyond this (we hit a value changing optimization in th portable simd group, yesterday...)
This is not very different from platforms that turn on "flush subnormal numbers to zero" by default, (like arm32), although I feel absurdly strongly that we should not adopt that nonsense just because one platform does it.
Did some fiddling with bitpatterns and NaN, so things look better now on my x86_64 machine but I haven't exactly turned this kind of thing into a test that runs on all platforms and LLVM might be cheating by knowing my inputs already (and thus, that I'm watching it):
@thomcc thanks! I have now opened a thread in the LLVM forum asking about the LLVM NaN semantics.
Most helpful comment
Related LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=45152