Julia: should min(0.0, NaN) be NaN?

Created on 6 Aug 2014  ·  51Comments  ·  Source: JuliaLang/julia

I know this has come up before and we opted to go with Matlab's behavior, but it strikes me that NaN poisoning really might be safer. After all a NaN could be larger or smaller than any value.

decision

Most helpful comment

This is one of the few cases where the standards body IEEE floating point defaults are wrong (for languages that are not assembly language or its equiv).

All 51 comments

+1 for poisoning. The same should apply to minimum too then.

One subtlety is _which_ NaN should you pick when Both arguments are NaN? I think in that case one valid choice is to use the IEEE specified total order.

x86 floating-point instructions all use the leftmost NaN argument, which seems like a reasonable thing to do. Adding more work to pick an ordering among NaNs seems unnecessary – I can't imagine any situation where which of two NaNs one picks is significant.

We went with standards here, e.g. fmin and, I believe, IEEE. This is not just taken from matlab.

IEEE does no specify min or max, just >, ==, <, <=,>= and friends.

 #include <stdlib.h>
#include <stdio.h>
#include <math.h>

void main(){
  double nann = 0.0/0.0 ;
  printf("%f\n%f\n%f",fmax(nann,1.0),fmax(nann,1.0),fmax(1.20,3.0) );
}

prints

~/D/r/intFloatRegisterFoo $ ./a.out
1.000000
1.000000
3.000000⏎

additionally, according to the c language reference:

  Returns the larger of two floating point arguments, treating NaNs as missing data (between a NaN and a    
  numeric value, the numeric value is chosen)

that seems redundant in the context of language support for explicitly coding missing data.

Section 5.3 of the IEEE fp standard specifies minNum and maxNum which behave this way.

you are correct sir! (though i must admit it bothers me, but it makes sense in the context of C)

section 5.3

― sourceFormat minNum(source, source)
sourceFormat maxNum(source, source)
sourceFormat minNumMag(source, source)
sourceFormat maxNumMag(source, source)

minNum(x, y) is the canonicalized number x if x maxNum(x, y) is the canonicalized number y if x

section 6.2 Operations with NaNs 6.2.0

Two different kinds of NaN, signaling and quiet, shall be supported in all floating-point operations. Signaling NaNs afford representations for uninitialized variables and arithmetic-like enhancements (such as complex-affine infinities or extremely wide range) that are not in the scope of this standard. Quiet NaNs should, by means left to the implementer’s discretion, afford retrospective diagnostic information inherited from invalid or unavailable data and results. To facilitate propagation of diagnostic information contained in NaNs, as much of that information as possible should be preserved in NaN results of operations.
Under default exception handling, any operation signaling an invalid operation exception and for which a floating-point result is to be delivered shall deliver a quiet NaN.
Signaling NaNs shall be reserved operands that, under default exception handling, signal the invalid operation exception (see 7.2) for every general-computational and signaling-computational operation except for the conversions described in 5.12. For non-default treatment, see 8.

Treating NaN as missing data seems incorrect and dangerous, despite the C standard.

I'd like to add a +1 for poisoning. I think if you're wanting to treat something as missing data, you have much better options than using NaN.

i'd favor poisoning too, I'm actually prepping a proposal to change how GHC haskell min and max work on floats to have the poisoning semantics (ie nan propagating).

I agree that not propagating NaNs here doesn't make sense. There's nothing special about min that grants it a definite value on undefined input. It's just that following IEEE is a good default decision.

One can have a vigorous debate about 1^NaN, but for min I'm not sure what the argument would be. Although this does raise the question of min(-Inf,NaN) :)

I would just keep it simple and say that if any argument is NaN, the result is NaN.

How about providing minnum and maxnum for programmers who want the IEEE semantics (or want the associated hardware efficiencies), and do NAN poisoning for min and max? Numpy calls the IEEE version nanmin.

I became curious about the rationale for the IEEE rules. William Kahan ("father of IEEE 754") seems to be the origin of the rules according to committee minutes:

Kahan proposed that it must be commutative & the default shall be to pass numbers over NaNs & he doesn't care about the sign of zero.

Kahan is an expert and so presumably had some motivation (possibly for experts), but I haven't been able to find it. I've sent him an email query.

It seems that the minsd instruction is ifelse(x < y, x, y), which is neither the NaN poisioning nor IEEE definition.

I erred when I wrote "associated hardware efficiencies". I had misinterpreted the description of minsd. The semantics of minsd were designed back in the '90s, well before IEEE 754-2008, to that compilers could optimize the common C idiom ```x

As for the semantics that we seem to favor, the Intel manual coyly states:

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand (from either the first or second operand) be returned, the action of MINPD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

I'm puzzling over the emulation that the author had in mind. I don't see how a single comparison suffices, since if one of the operands is a NaN, the information from the comparison does not distinguish which operand is the NaN.

FYI, I inquired with the experts here, and the recommended AVX sequence for NaN-propagating min is:

VMIN R, a, b           // result is b if a or b are NaN, min(a,b) otherwise
                       // so Nan is not propagated only if a is the NaN
VCMPNEQ M, a, a        // M=11…11  if a is NaN, 0 otherwise
VBLENDV Res, R, a, M   // Res = R if M=0 (a not NaN), otherwise Res=a (if a is NaN)

It's only one instruction shorter than the code currently generated from the NaN-propagating definition min{T<:FloatingPoint}(x::T, y::T) = ifelse((x < y) | (x != x), x, y) . If we lose sleep over that extra instruction, we can submit a patch to LLVM. :-)

does that code sequence properly handle signedness of zero? Afaik, the VMIN operations don't distinguish -0 vs 0

The sequence will compute min(-0,+0) as +0. Detecting signedness of zero is problematic for any comparison-based min routine, since IEEE 754-2008 says Comparisons shall ignore the sign of zero (so +0 = −0).

As someone who has only an introductory systems course level understanding of IEEE 754, the one thing that sticks out in my mind about NaNs is that they tend to poison computations, so this change might be intuitive for the average user regardless of what the standard actually says :)

I like the idea of providing separate functions for those who do want the standard-specified behavior if we decide to do this.

one small detail that I think might play nice with future nan interactions, would be that, in the case that both arguments are nans, perhaps the result nan should be the bitwise OR of the two input nans (idea being that those 53 bits otherwise unused in NANs could represent different causes of the original nan error, and that would give a neat trick for saying "heres the set of ways your program borked the math!")

this LLVM thread is relevant http://article.gmane.org/gmane.comp.compilers.llvm.cvs/201804
i'm going to ask them about the nan returning semantics option

William Kahan kindly replied to my enquiry about the IEEE definition of min. His note suggests a _4th_ possible definition of min that propagates NaN except when one of the arguments is -∞, in order to preserve the identity min(-∞,y)==-∞.

Here is his reply in full:

Arch:
The recommendation that  min{ x, NaN}  and  min{NaN, x}  both be  x  instead of  NaN 
arose at a time when graph plotting programs had nowhere to put a  NaN,  and could
abort instead.  In my experience the most common occasions when  min{x, NaN}  arose 
were in plotting graphs of functions whose domains had holes.  For instance,  let  Y(x) 
be the solutions of the equation   x^2 - y^2 = 1 ,  namely  Y(x) = */- sqrt( (x-1)(x+1) ) ;
and plot  Y(x)  vs.  x  over,  say,  -2 < x < 2 .  Then the result should be two branches of 
an hyperbola separated by a gap where  -1 < x < 1 .  To bridge the gap we perform a 
"Windowing"  operation by plotting  max{-2,  min{+2, Y(x)} }  in a square window  that
barely contains  -2 < x < 2   and  -2 < y < 2 ,  putting the  NaNs  on the window's upper
edge where they can be ignored.  This resolution of the problems posed by domains 
with holes is not perfect,  but better than the alternatives available at the time to cope 
with holes whose locations could.not easily be predicted before the attempt to plot.

There must be occasions where  minm(x, NaN}  and  maxm{x, NaN}  should be  NaN 
instead of  x  to serve a computational necessity.  Here I have used different names for 
the minimum and maximum functions;  you might disagree with my choice of names.

However,  maxm(+Infinity, NaN)  must be  +Infinity,  NOT NaN,  to honor the rule that 
if a function  f(x, y)  has the property for some value  X  that  f(X, y)  is independent of 
y ,  be it finite or infinite,  then that  f(X, NaN)  must be the same as  f(X, y) .  This may 
make  maxm  harder to implement than  max . The foregoing rule is crucially necessary. 
If there were no way to get rid of an irrelevant  NaN  then it might as well stop computation 
at its earliest encounter.

I hope that my explanation helps.  If you have a better rationale I will be glad to entertain
it.

With best wishes,
                                                                    Prof. W. Kahan

huh, this is a very interesting and good point about the "Laws" of min and max on the reals

That is a great email, as expected. I think propagating NaN except for +-Inf is a very good definition for min and max.

Honestly I find the plotting justification shockingly weak. There's no reason to assume plotting must be done by applying min and max to points, and nothing further. It would be far better to check for NaN or out-of-range values and simply omit them. If there is no better justification, I would hope this is changed in the next IEEE standard revision, if there is one. They could make it backwards compatible by adding new operations with this behavior.

We should get "I :heart: Kahan" t-shirts made.

"if a function f(x, y) has the property for some value X that f(X, y) is independent of y, be it finite or infinite, then that f(X, NaN) must be the same as f(X, y) ."

Does anyone know the justification for this rule? I'm not necessarily disputing it, I primarily want to understand it.

I have also asked Professor Kahan about this, and one thing he said was that this rule assumes that the environment has floating-point flags which record floating-point exceptions that have occurred, and which can be easily queried. We don't necessarily need the return value of an expression to record that a NaN happened if there's a flag which holds that information that we can test.

However, Julia, and every other programming language I'm aware of besides assembly, doesn't provide easy and reliable access to these flags (including C, even with fenv.h, because in practice compilers like clang and GCC don't implement #pragma STDC FENV_ACCESS). It's not a coincidence that modelling floating-point operations as operations that mutate global state is not something that easily fits into high-level languages.

Given that Julia doesn't expose the floating-point exception flags, it's not clear that this justification applies here.

This rule does not apply e.g. for adding 0.0. 0.0+y == y for all values of y, except if y is a nan. (I hope I'm not stumbling about a special case for -0.0 here.)

This actually inhibits compiler optimizations. With full IEEE compatibility, 0.0+y cannot be optimized to y.

My impression is that the rule is derived from the assumption that NaN represents "could be any real number". But I'm not so sure about the rule if the NaN came from sqrt(-1.0).

0.0+y is "the same as" y when y is NaN. It's just that == with a NaN is not the same as "same as". It's 0.0+(-0.0) that's the thorn in the side for optimizing 0.0+y, since 0.0+(-0.0) is 0.0. According to Muchnick's text the only safe optimization of IEEE arithmetic is replacing x/c by x*(1/c) when c and 1/c can be represented exactly (which means c has to be an integral power of 2).

The mutable global state of IEEE arithmetic was a big mistake in my opinion. It plays badly in highly pipelined processors and parallel programming environments. I once heard a talk by Guy Steele where he suggested it was time to revise IEEE 754 to remove the global state. I recall that he had to shorten the significand by one bit so that he could use the bit for other purposes, thus his proposal was not a backwards-compatible format.

I don't really follow the argument for making max(Infinity, NaN) = Infinity. This debate is very similar to the X-optimism vs. X-pessimism debate in hardware simulators, but there's a distinction, which is that X in a hardware simulator represents some unknown (but representable) value. As Arch points out, NaN can represent all sorts of crazy stuff. Why should max(Infinity, some nonsensical thing) be Infinity?

The infinity / infinity is also Nan.
On Sep 19, 2014 4:04 PM, "Dan Luu" [email protected] wrote:

I don't really follow the argument for making max(Infinity, NaN) =
Infinity. This debate is very similar to the X-optimism vs. X-pessimism
debate in hardware simulators, but there's a distinction, which is that X
in a hardware simulator represents "either 0 or 1", whereas NaN can
represent all sorts of crazy stuff, like Infinity/Infinity. What's max(Infinity,
Infinity/Infinity)?


Reply to this email directly or view it on GitHub
https://github.com/JuliaLang/julia/issues/7866#issuecomment-56246761.

Sorry, I edited the message on github so the email notification is out of
date.
On Sep 19, 2014 9:19 PM, "Carter Tazio Schonwald" [email protected]
wrote:

The infinity / infinity is also Nan.
On Sep 19, 2014 4:04 PM, "Dan Luu" [email protected] wrote:

I don't really follow the argument for making max(Infinity, NaN) =
Infinity. This debate is very similar to the X-optimism vs. X-pessimism
debate in hardware simulators, but there's a distinction, which is that
X
in a hardware simulator represents "either 0 or 1", whereas NaN can
represent all sorts of crazy stuff, like Infinity/Infinity. What's
max(Infinity,
Infinity/Infinity)?

Reply to this email directly or view it on GitHub
https://github.com/JuliaLang/julia/issues/7866#issuecomment-56246761.

Reply to this email directly or view it on GitHub
https://github.com/JuliaLang/julia/issues/7866#issuecomment-56254304.

some nonsensical thing is always NaN

though you do raise a good point indirectly, that the IEEE spec gives language implementors leeway on how they can use / treat / distinguish (different non signalling) nans. This isn't used in many languages, but it is possible!

The sqrt(-1) example shows that NaN might not be real at all, in which case max(Inf,NaN) == Inf would be erroneous. However, we raise an error for that case, rather than returning NaN, so I think that it's arguable that if you have a value of type Float64, you can assume that it's some kind of real value (even if it's not a number). I'm not sure this is super important though. Either max(Inf,NaN) => Inf or max(Inf,NaN) => NaN seem reasonable to me.

I definitely don't have a strong opinion on this, since it's a corner case of a corner case. I especially don't have strong opinion if NaN can't be a stand-in for an unrepresentable value in Julia.

Unless there's something I'm missing, I'd vote for not special casing Inf because that would add an extra branch. The thing I might be missing is, does anyone actually use Inf to squash NaN, as Kahan implies above?

In hardware, knowing whether your simulator uses X-optimism or X-pessimism is critically important, because "regular" values can cause X to be squashed in one of the two cases (e.g., if you have X coming into a mux leg that isn't selected), and you have to adjust your coding style appropriately. This seems pretty different to me, unless people actually code things where they expect NaN to get killed by Inf. That seems odd to me, but I don't have nearly as much experience as Kahan.

but I don't have nearly as much experience as Kahan.

LoL. Who does?

I don't think the cost of branch should determine what the default end user semantics are for min/max. For those who want last mile microoptimizations, having an alternative "whatever the hardware supports most efficiently" version somewhere else but easily found is probably best wrt balancing default UX

Honestly, I just don't think this matters very much – neither option is clearly right. We should just do whatever can be done most efficiently.

We should by default do whatever the standard prescribes. If people want a different option, then we can introduce a new function for this, with a name that indicates that it doesn't adhere to the standard. And if people want performance (I do!), there can be macros or options or other mechanisms to ignore the inconvenient parts of the standard.

If you look at the rest of this issue, the standard doesn't take a stand on this point, which is why @ArchRobison wrote to Kahan about this. But it's still unclear since his reasoning seems to assume properties of languages that just aren't true of any language but assembly.

I believe the IEEE standard (as well as POSIX and OpenCL) do take a stand -- if one argument is a number and the other is nan, the number shall be returned. POSIX and OpenCL call this operation fmin, IEEE calls it minNum, but this corresponds to min for Julia.

This is one of the few cases where the standards body IEEE floating point defaults are wrong (for languages that are not assembly language or its equiv).

the point being, in a high level language, NAN values are used to denote erroneous answers, and having that propagate upwards _visibly_ is good engineering (given the constraints in ease of seeing CPU flags about floating point state)

The consensus in this thread seems to be that the behavior in the existing standard is both surprising and not a good idea. The rationale given for the standard behavior, given upthread, certainly doesn't make sense now.

FWIW, the IEEE standard behavior is non-standard among actual implementations (see table 1). The only modern language I could find that matches the standard is Julia; the other two that do are C and Octave/MATLAB.

The discussion here is going toward #5234

I say we go ahead with this in 0.6.

We should probably have a function that gives the old behaviour for standards-junkies. numpy and matlab use nanmin.

Eh, let's wait and see how much complaining there is first. Can also go in a package.

FYI, this question is also discussed here: []https://stackoverflow.com/questions/49011370/nan-propagation-and-ieee-754-standard/49040225
I agree that a global state does not work well with parallel processing. Now that more and more platforms support vector processing, it is better to rely on NAN propagation than fault trapping because you can have multiple faults simultaneously in one vector. In my opinion, min and max should propagate NAN always.

That's good to hear 👍 – that is what Julia does these days.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

felixrehren picture felixrehren  ·  3Comments

Keno picture Keno  ·  3Comments

sbromberger picture sbromberger  ·  3Comments

iamed2 picture iamed2  ·  3Comments

ararslan picture ararslan  ·  3Comments