Julia: should `in` use `isequal` to test for containment?

Created on 16 Dec 2014  Â·  138Comments  Â·  Source: JuliaLang/julia

This is how Dicts work, so there's a case to me made. Came up in this discussion:

https://groups.google.com/forum/#!topic/julia-users/XZDm57yHc5M

Motivating examples:

julia> NaN in [1,1.5,2,NaN]
false

julia> NaN in Set([1,1.5,2,NaN])
true

julia> NaN in keys(Dict(1=>"one", 1.5=>"one and a half", 2=>"two", NaN=>"not a number"))
true

It seems like in for arrays is the one that's out of line here.

breaking help wanted

Most helpful comment

Revisiting this, I tend to think we should leave it as-is. Currently in has two kinds of behaviors:

  1. For containers with explicit ideas about what they contain (like dicts and sets), the container determines the equality test used.
  2. Everything else uses ==.

This is nice and simple. If we don't want to change rule (2) to use isequal instead of ==, but we want to change something, then we have to move to 3 kinds of behaviors instead of 2. I think that would be bad; for example x in (f(y) for y in itr) should give the same answer as x in [f(y) for y in itr].

All 138 comments

Wouldn't this necessitate two versions of Dicts / Sets, one using isequal and the other using egal for containment? I would find this behavior to be very confusing. For example is Set([1,1.0]) a single element set?

Edit. I see we already have this behavior.

This is another case that can be justified on the basis that the IEEE standard specifies the behavior of ==, but not of in. In other words, the standard doesn't say an in function must use IEEE == to compare elements. Arguably the IEEE NaN behavior should be used as rarely as possible. The case of -0.0 is trickier; 0 in A not finding a -0.0 could be a problem.

@jakebolewski, Set([1,1.0]) _is_ already a one-element set:

julia> Set([1,1.0])
Set([1.0])

This is a pretty straightforward consequence of using Dict as the foundation for Set. If we wanted the egal behavior, we would need an ObjectIdSet type to correspond to ObjectIdDict.

The case of -0.0 is trickier; 0 in A not finding a -0.0 could be a problem.

This kind of sucks, but I think we have to live with it – it's either that or introduce yet another kind of equality, which we can do over my cold, dead body.

We definitely don't need another kind of equality. I think we will opt to keep you alive instead.

I'm relieved.

The other option would be to make -0.0 and 0.0 the same for isequal and hashing, but that seems bad too.

Oh, we can't make isequal(-0.0, 0.0) since that would mean that -0.0 wouldn't sort before 0.0.

Is it really required that isless be consistent with isequal? Anyway with several concepts of equality, you already need to look at the docs to check what it's behavior is.

As a data point, MATLAB's isequal and ismember consider -0.0 and 0.0 as equal. R doesn't have isequal, but its identical function allows for either behavior via an argument, and its %in% function considers them as equal. Python's in does the same. So it doesn't look like distinguishing negative and positive zero is a compelling use-case: it's fine if it's only supported in special dict/set types. On the contrary, distinguishing them by default has the potential of introducing very subtle bugs for the 99% of users (source: classified survey) who didn't even think zero is not always zero.

Is it really required that isless be consistent with isequal?

It's not strictly necessary, but I think anything else would be kind of a disaster, imo.

Interesting data point:

In [19]: np.nan == np.nan
Out[19]: False

In [20]: np.nan in [np.nan]
Out[20]: True

In [21]: -0.0
Out[21]: -0.0

In [22]: 0.0 in [-0.0]
Out[22]: True

I'm not sure how python's in is handling equality here.

It's not strictly necessary, but I think anything else would be kind of a disaster, imo.

I'm not so sure. -0.0 and 0.0 are sorted next to one another, so considering them as equal does not go against the ordering, it just loses a small bit of information. (I guess there is theory to express that more generally.) In what cases would that be an issue?

One thing worth considering is making isless match the IEEE totalOrder predicate: the main difference is that NaNs with different bit-patterns are distinct, and negatively-signed NaNs precede -Inf: in a certain sense, this is the ordering corresponding to ===.

This could be done via

function totalorder(x::Float64,y::Float64)
    xi = reinterpret(Int64,x)
    yi = reinterpret(Int64,y)
    (xi < yi) $ ((xi < 0) & (yi < 0))
end

(which also seems to be faster than our current isless).

That sort order has always struck me as weird and useless – why would I want some NaNs to come at the beginning and some at the end? If we started doing this, I feel like we'd almost need to start printing NaNs with a sign, which would also be strange. The only advantage I can see to the IEEE ordering is that you can implement it efficiently with integer comparisons.

you can implement it efficiently with integer comparisons.

Yep, that's how to think like a 1980s computer scientist :)

Alright, so I would be ok with making isless a finer grained ordering than isequal, namely:

  1. isequal(-0.0, 0.0) but isless(-0.0, 0.0).
  2. isequal(nan1, nan2) for any two NaN values, but have isless order NaNs by bit pattern.
  3. all other equivalence classes are the same for isequal and isless.

I'm not inclined to have some NaNs negative while others are positive. That's just silly and useless.

Yes, that might be OK. Have to think about it.

By the way, if there are any python experts around I really would like to know what is happening in the example I posted above. It seems in is not calling __eq__ for np.nan. However it doesn't use is either, as this example shows:

In [1]: 1 is 1.0
Out[1]: False

In [2]: 1 == 1.0
Out[2]: True

In [3]: 1 in [1.0]
Out[3]: True

EDIT: Ok, my guess is that in uses is first, and returns true if it does, before calling __eq__.

Oh, Python, you so crazy.

Python's in is implemented by PySequence_Contains, which defaults to using PyObject_RichCompareBool(obj, item, Py_EQ), which calls the tp_richcompare slot of the type object, which for floating-point types turns into float_richcompare (whose comments begin Comparison is pretty much a nightmare) which uses the C == for two floating-point variables. _However_, PyObject_RichCompareBool _first_ checks for object identity in which case it just skips the comparison. (See also the documentation for PyObject_RichCompareBool, which actually mentions this.)

So, Jeff's guess is correct.

(Yes, I've spent far too much time looking at the CPython codebase for someone who never programs in Python.)

Clearly, I used up my "oh, Python, you so crazy" way too early in this thread.

This is an abstract from Python's documentation for comparison operators:

>>> help('in')

...

Comparison of objects of the same type depends on the type:

...

* The values "float('NaN')" and "Decimal('NaN')" are special. The
are identical to themselves, "x is x" but are not equal to
themselves, "x != x".  Additionally, comparing any value to a
not-a-number value will return "False".  For example, both "3 <
float('NaN')" and "float('NaN') < 3" will return "False".

...

The operators "in" and "not in" test for membership.  "x in s"
evaluates to true if *x* is a member of *s*, and false otherwise.  "x
not in s" returns the negation of "x in s".  All built-in sequences
and set types support this as well as dictionary, for which "in" tests
whether the dictionary has a given key. For container types such as
list, tuple, set, frozenset, dict, or collections.deque, the
expression "x in y" is equivalent to "any(x is e or x == e for e in
y)".

For the string and bytes types, "x in y" is true if and only if *x* is
a substring of *y*.  An equivalent test is "y.find(x) != -1".  Empty
strings are always considered to be a substring of any other string,
so """ in "abc"" will return "True".

For user-defined classes which define the "__contains__()" method, "x
in y" is true if and only if "y.__contains__(x)" is true.

For user-defined classes which do not define "__contains__()" but do
define "__iter__()", "x in y" is true if some value "z" with "x == z"
is produced while iterating over "y".  If an exception is raised
during the iteration, it is as if "in" raised that exception.

Lastly, the old-style iteration protocol is tried: if a class defines
"__getitem__()", "x in y" is true if and only if there is a non-
negative integer index *i* such that "x == y[i]", and all lower
integer indices do not raise "IndexError" exception.  (If any other
exception is raised, it is as if "in" raised that exception).

The operator "not in" is defined to have the inverse true value of
"in".

The operators "is" and "is not" test for object identity: "x is y" is
true if and only if *x* and *y* are the same object.  "x is not y"
yields the inverse truth value.

So even while 1 and 1.0 are equal, they are not the same object in memory:

In [1]: 1 is 1.0
Out[1]: False

In [2]: id(1) == id(1.0)
Out[2]: False

@Ismael-VC, so their documentation is wrong, because it effectively uses x is z or x == z, not x == z.

Actually, the radix sort function in SortingAlgorithms.jl does use integer
comparisons (and some shenanigans) for sorting floating point numbers.

On Wednesday, December 17, 2014, Jeff Bezanson [email protected]
wrote:

you can implement it efficiently with integer comparisons.

Yep, that's how to think like a 1980s computer scientist :)

—
Reply to this email directly or view it on GitHub
https://github.com/JuliaLang/julia/issues/9381#issuecomment-67360582.

I think -0.0 is more important than NaN: code that tries to compare with NaN is always going to have problems. Finding 0 in an array containing -0.0 would be what most people want. So I think we have to use either "== or isequal" or just ==. Using a specific function is definitely simpler, so I say we keep the current behavior.

How would you implement that efficiently for Sets? I feel that as containers arrays should behave like ordered multisets, which to me implies using isequal.

On Jun 22, 2015, at 6:27 PM, Jeff Bezanson [email protected] wrote:

I think -0.0 is more important than NaN: code that tries to compare with NaN is always going to have problems. Finding 0 in an array containing -0.0 would be what most people want. So I think we have to use either "== or isequal" or just ==. Using a specific function is definitely simpler, so I say we keep the current behavior.

—
Reply to this email directly or view it on GitHub.

I think it's mostly ok for different containers to use different tests. Not finding 0 in an array with -0.0 could cause serious problems.

We could also make isequal(-0.0, 0.0).

Yes, @StefanKarpinski's plan at https://github.com/JuliaLang/julia/issues/9381#issuecomment-67361516 sounds the best solution.

Thanks for finding that comment, @nalimilan. To quote it and add to it:

  1. isequal(-0.0, 0.0) but isless(-0.0, 0.0).
  2. isequal(nan1, nan2) for any two NaN values, but have isless order NaNs by bit pattern.
  3. all other equivalence classes are the same for isequal and isless.
  4. use isequal for collection containment checking (arrays and sets).

What I dislike about that is that isequal and isless are supposed to be the _sane_ comparisons. At this point I have half a mind to make NaN==NaN and -0.0==0.0 and remove isequal. Another option is to disallow NaN as a hash key.

Does the proposed change make them insane comparisons?

@StefanKarpinski Pun intended? I think he meant "the same comparisons".

I'm not sure that's really a requirement, though. And wouldn't NaN == NAN go against common expectations?

wouldn't NaN == NaN go against common expectations?

It's a funny way of putting it, I see what you mean but still, "common expectations" might include reflexivity of equality :-). I don't do fp math so I don't know how much people are attached to this part of the IEEE semantics.

Does the proposed change make them insane comparisons?

I would say so, yes. I definitely meant "sane" and not "same". The idea of isequal and isless was to have well-behaved equal and less predicates, obeying as many of the usual mathematical properties as possible.

I think I'd be ok with isequal(-0.0,0.0) and !isless(-0.0,0.0) though. Then isequal only differs for NaN.

In #8463, there was also some discussion of making hash(0.0) == hash(-0.0): this would also make the hash function slightly simpler.

I think I'd be ok with isequal(-0.0,0.0) and !isless(-0.0,0.0) though.

I must be missing something, but wouldn't this cause problems with sorting?

wouldn't this cause problems with sorting?

I suppose you'd get zeros and negative zeros in an undefined order. We could provide an IEEE totalOrder predicate as well. But look, IEEE says -0.0 and 0.0 are equal, so what do they expect?

Looking back at the original motivation of the issue, I think one has to distinguish the case of deliberately searching for NaN using NaN in Y, vs. writing x in y generically, and considering the behavior when x happens to be a NaN. In the first case, you really need to use isnan. In the second case, it doesn't matter what happens --- any program that ignores NaNs is going to get weird answers if NaNs arise somewhere.

It would help to disallow NaNs as dict keys. It would make sense to have a test like

if !(key == key)
    error("key cannot be reliably identified and so cannot be inserted")
end

Then we could really start considering removing isequal.

Whatever IEEE says about -0.0 and 0.0 being ==, having you get a many-to-one relationship when inserting something into a set or dict would not be a good thing, IMO. That's losing information.
NaN seems kind of like null, do you allow a null key in a set or dict?
Also, all NaN's are not alike.
I do like the above test to give an error - I think it is consistent with the way null should be handled, i.e.
a null value is not == to anything else, even another null, so the above test would handle that consistently.

Of all the possible issues here, I find it especially easy to imagine problems caused by distinguishing -0.0 and 0.0. Say an array has a -0.0, then somebody needs to ask "are there any zeros here?" It's definitely surprising if 0 in a doesn't find the zero.

With -0.0, you have to be really careful to propagate signs (for example this makes the code in complex.jl much more...complex). If it didn't exist, you'd have to be really careful computing 1/x where x might be -Inf. To me this is a wash.

OK, then why not just say that -0.0 is converted to 0.0 when it is used as a key? Instead of trying to do something tricky when looking for it (making hashes equal even though the bit patterns are different, etc.)

According to https://github.com/JuliaLang/julia/issues/9381#issuecomment-114568028 it's not any trouble to make -0.0 and 0.0 hash equally, given what our hash function needs to do anyway.

Normalizing -0.0 to 0.0 on key insertion might be a good idea though.

I think that's easier to explain to people (it's the approach we took for -0.0, and didn't have any complaints).

@JeffBezanson, if you're going to call something crazy, it's helpful if you explain why you think it's crazy. Is it the ordering of NaNs that you object to? I have to assume so since you've said that making isequal(-0.0, 0.0) seems acceptable to you. The NaNs part is not that necessary, but it would make sorting floats much simpler since you wouldn't need to use a stable partition algorithm when moving the NaNs to the end.

Having isequal(-0.0, 0.0) and isless(-0.0, 0.0) shouldn't be a problem for sorting since we never use isequal when sorting, we only use isless. In general, it's fine for isequal to be more fine-grained than isequal, but you want isequal and hash to agree in the sense that isequal(x,y) implies hash(x) == hash(y) and the inverse implication holds except for collisions, which should be hard to construct.

I really dislike the idea of converting -0.0 to 0.0 upon using it as a key value. It would then be the _only_ value we convert to a different value of the same type when using it as a key. I dislike the idea of disallowing NaN as a key value even more. That smacks of Matlab-style pointless exceptionalism.

I did say why:

The idea of isequal and isless was to have well-behaved equal and less predicates, obeying as many of the usual mathematical properties as possible.

I think it's fair to say that ideally, we would only have == and <. It's frustrating to add an extra set of comparison functions, and _still_ not have them obey expected properties.

The idea is not to disallow NaN specifically, but to sanity check that x == x holds. Of course NaN is the only value we know of that violates that, but it seems more justifiable than checking isnan.

If you have a SQL style null value, it needs the same treatment, as x != x.

I disagree that the point of isequal and isless is to "obey as many of the usual mathematical properties as possible". The point of isequal is to be used for hashing while the point of isless is to be used for ordering things – if they fail at this then you'd need yet another pair of relations for those purposes. The whole notion of "obey the expected properties" is ill defined – different people expect different things. I think that we should make sure that isequal and isless are good for hashing and ordering and live with == and < as the day-to-day mathematical operators, despite their IEEE oddities.

In this democracy that is Julia, I'll vote against disallowing NaN as a key. I'm worried that it will lead to all kinds of special-casing in user code; eg, I do a group-by on a dataframe column, some of whose values are NaN. I'm expecting a dictionary back whose keys are the unique value in that column, but that would no longer be possible.

I agree – disallowing NaN as a hash key would be a disaster.

@malmaud Don't you then need to make sure you preserve the bit value of the NaN?
(as there are different NaNs, and if you really want to be saving them, you should be preserving them as is)
Maybe the problem is trying to use isequal/isless, which have other uses, where you don't want to change the meaning. Instead, for sets/dicts/etc. you have a sortsbefore and sortsequal (I don't know if those names are already used, if so, pick others), where the fallback would be isless and isequal.
For nullable floats, you could have null sortsbefore NaNs, then for any floats, youd have NaNs sortsequal with the same bitpattern, and sortsbefore based on their bitvalues, then -Inf, all negative floats, -0.0, 0.0, all positive floats, +Inf.

Ok, I can live with NaNs as dict keys.

I think isequal being consistent with isless directly relates to usefulness for hashing and ordering. For example, we have the following definition to order Pairs lexicographically:

isless(p::Pair, q::Pair) = ifelse(!isequal(p.first,q.first), isless(p.first,q.first),
                                                             isless(p.second,q.second))

Under the proposed change, this definition is wrong. All it does is implement an obvious definition of lexicographic ordering: if the first elements are equal, look at the second elements. As it is now, isequal and isless provide a welcome respite from floating point corner cases.

Ok, now we're getting somewhere – that's a good, practical reason why this change may be problematic.

You could write it as

isless(a::Pair, b::Pair) = isless(a[1],b[1]) | !isless(b[1],a[1]) & isless(a[2],b[2])

But the larger question is if it's worth making working with isless and isequal generally harder.

Right.

I agree that sorting NaNs to the end of an array is useful, but how important is it to sort -0.0 before 0.0? After all, they're equal.

We'd have to do stable sorting for the portion of the array that's ±0.0 then, which is equally annoying. They're == but not === or isequal and there's a visible difference between -0.0 and 0.0.

They are not identical, if you want to say, have a set with all the values you see, you'd want to know that -0.0 was in the set, not 0.0. Otherwise, you might as well convert the -0.0 to 0.0 on insertion.
Same things for NaN, you are just using the printed representation of NaN, but what about all the other bit patterns that are NaN? Sometimes those are displayed, such as -NaN(s1234)

Sure they're not identical, but that's what === is for. What -0.0 means is not generally agreed on; it's really just an alternate representation of zero.

Clearly one can construct cases both where you want to distinguish them, and where you don't. However I would _heavily_ weight things towards _not_ distinguishing them. I think it's much worse to end up with two zeros when you think you have unique values.

There seems to be a bit of a problem, if you have two different NaNs, say:

julia> a = reinterpret(Float64,0xfff0000000000001)
NaN
julia> b = reinterpret(Float64,0xfff0000000000002)
NaN
julia> c = reinterpret(Float64,0xfff0000000000001)
NaN
julia> a == b
false
julia> a === b
false
julia> a === c
true
julia> hash(a)
0x15d7d083d04ecb90
julia> hash(b)
0x15d7d083d04ecb90

The way hash works is not consistent with ===, nor with ==.

Also, there's this issue with hash of 0.0 and -0.0:

julia> hash(0.0)
0x77cfa1eef01bca90
julia> hash(-0.0)
0x3be7d0f7780de548
julia> -0.0 === 0.0
false

As documented, hash is consistent with isequal. As the original issue describes, dicts use isequal.

The hash function only needs to be consistent with the isequal function, which it is.

???
Normally, a == b implies hash(a) == hash(b), right?
That does not hold true at all for NaNs.
Also, for as I showed, hash(0.0) != hash(-0.0), even though 0.0 == -0.0.

Yes, hash is consistent with _some_ notion of equality, but which one? Answer: isequal.

Then _why_ does hash(0.0) return a _different_ result from hash(-0.0)?

No, since == implements IEEE equality which is very much not what you want for hashing. Consider what the fact that NaN != NaN would imply for doing d[NaN] = 1 repeatedly.

So, hash is NOT consistent with isequal, not for 0.0 and -0.0.

julia> isequal(-0.0, 0.0)
false

Never mind... so there is ==, isequal, and ===? Sorry, I thought == was the same as isequal,
like === is the same as ≡
Some very strange things happen currently with using NaNs as keys in Dicts, for example.
This seems like a bug to me:

julia> g = Dict(1=>"one", 1.5=>"one and a half", 2=>"two", NaN=>"not a number")
Dict{Any,ASCIIString} with 4 entries:
  NaN => "not a number"
  2   => "two"
  1.5 => "one and a half"
  1   => "one"

julia> u = keys(g) ; a,p = next(u,1) ; reinterpret(UInt64,a[1])
0x7ff8000000000000

julia> setindex!(g,"silly",-0.0/0.0)
Dict{Any,ASCIIString} with 5 entries:
  NaN                => "silly"
  2                  => "two"
  1.5                => "one and a half"
  1                  => "one"

julia> u = keys(g) ; a,p = next(u,1) ; reinterpret(UInt64,a[1])
0xfff8000000000000

Also, if you store 0.0 in a Dict, and then store 0, you overwrite the the 0.0 entry.
If you store 0x00000, it also replaces it. It seems very strange to have the types of the keys changing,
if they are being treated as the same value.

@ScottPJones, see https://github.com/JuliaLang/julia/pull/6624 for context.

Design criterion (in retrospect). Assume that a Dict{K,V} tries to convert keys and values to types K and V upon insertion into the dict. You would also like for code using Dict{K1,V1} and Dict{K2,V2} objects to behave the same (value-wise) when inserting and looking for the same values, assuming that neither one throws a convert error because some key of value couldn't be converted to K1 or K2, etc. I'm pretty sure this implies that you have to hash things by value the way we are currently.

I'm starting to think it doesn't really matter whether or not isless(-0.0, 0.0). Let's fix the inconsistency between arrays and dicts reported in the issue description one way or the other, _that_ is much more important. Then the detail of -0.0 can be reconsidered if we find more data points (uses cases?) -- I would be surprised if it breaks existing code.

I let my subconscious process this overnight, and I realized what bothered me about this.
This is a leak of the implementation into the abstraction, IMO.
You are implicitly assuming that hashing has anything to do with Set or Dict.

I don't really get the objection. You can use isequal for things that are not implemented with hash. The main property it needs is being an equivalence relation, unlike ==, and not being as granular as ===.

Implementing a Set or Dict using an associative array (implemented either with a B+tree for persistent storage, or a hash (with ordering) or ptrie structure, for in-memory, you'd convert all of the keys into a vector of bytes that gives you the ordering you wish, and if you want all NaNs (or -0.0, 0.0, 0, 0x0) to return the same key, which means you can't get back the type of the key, which you are doing here
(unless you expend a lot of extra space to store the original key as well).

There is some notion of equality under which 0.0 and 0x0 are equal, and some notion under which they are not. Both kinds of relations are useful. As far as I can tell, this is fundamental. It will always be possible to have multiple representations of the same (in some meaningful sense) value on a computer.

If you're advocating for === as the "one true equality operator", then I don't entirely disagree. It's the only thing that reliably distinguishes values if and only if they _can_ be distinguished.

However, if we assume that somebody wants ==, and a dictionary that uses it, there's no reason for them to complain when the key 0.0 is replaced by 0x0000, since those values are equal according to the relation they're using.

Yes, both are useful, and hopefully it would be possible to make a Dict where you could define the equality and isless tests.
In my experience though, I'd never seen a dictionary where the relation was == and it always updated the key when a field was modified. The key was always stored in a canonical form (i.e. Int(0) instead of Float64(0), UInt8(0), UIntxx(0), "0", etc.). That is the part that threw me with the way Dict's are implemented in Julia. That also makes it harder to compare two Dict's (== can be expensive, compared to using === on the transformed keys).

I could imagine parameterizing a Dict by the type of equality you wanted (defaulting to whatever is most reasonable).

Yes, both are useful, and hopefully it would be possible to make a Dict where you could define the equality and isless tests.

I've argued before and I still believe that this is better accomplished by explicit normalization of values before using them as keys. We have typed dicts, so you can already enforce values being canonicalized to some expected type like Int. If you want object identity hashing, that's what ObjectIdDict is for.

I'm going to throw this proposal out there:

  1. Get rid of isequal and always use x == y || x != x && y != y or its equivalent as the canonical "intuitive" notion of equivelence. The isequal function could remain as a hook to allow overriding the implementation of that relation for various types but it should always _implement_ that relation.

    1. This implies that there is a single equivalence class for things that are not self-equal.

    2. It also implies that -0.0 and 0.0 are equivalent and should hash the same.

  2. Keep isless to allow custom ordering, defaulting to isless(x,y) = x < y but allowing types to override this ordering with a finer relation, but not with a coarser one. That is:

    1. One can have isequal(x, y) and isless(x, y), e.g. for -0.0, 0.0 or for different NaNs.

    2. If isless(x, y) one must have !isequal(x, y).

In particular, this would mean that floating point values would sort in this order:

  1. -Inf
  2. negative finite values
  3. -0.0
  4. 0.0
  5. positive finite values
  6. +Inf
  7. NaNs sorted by their bits

Classes 3 and 4 would be a single equivalence class under isequal and class 7 would be a single equivalence class even though it's contents would be ordered. When using isless, you'd have to be careful not to assume that !isequal(x, y) ⟺ isless(x, y) || isless(y, x) since only the forward implication would hold, not necessarily the backward one. In general, this would simply entail defining generic isless methods in terms of more specific isless methods, rather than using isequal as above. Most of the time you would not bother defining isless or isequal since the defaults would be correct.

I would argue from a point of view of a user.

The operators ==, <, are somewhat precious, and people should be able to use them (for their own types) with whatever generalization of equality or ordering they see fit. Such a generalization may not be useful for searching or sorting, and thus Julia uses isequal and isless for these. That is, while isequal defines equivalence classes and isless defines a total ordering of these, the operators == and < don't necessarily need to. (This is different from @StefanKarpinski's proposal above.) If isequal and isless behave in a strange manner that is basically unheard of except for IEEE floating-point semantics, too many people would be surprised.

By default, isequal and isless are defined in terms of == and <. This gives people freedom, and puts the burden of using the slightly verbose isequal and isless on those that implement sorting and searching for generic types, which is arguably a smaller number of people.

Also, since floating point numbers already have a strange notion of equality (no equivalence classes) and ordering (not ordered in the mathematical sense), this basically forces Julia to not use == and <.

How to implement isequal and isless for floating-point numbers is then up for debate. Personally, I would do this by reinterpreting the bits as integers, but obviously treating +0.0 and -0.0 as isequal, or treating all nans as isequal, both make sense. (Since floating point operations are inexact, one needs to be careful when comparing anyway, so using bitwise semantics makes sense as it sidesteps the issue.) If this doesn't work for a particular case, then one can always work around this by introducing a (cheap) wrapper class, or by using custom comparison operators, or a similar mechanism. An example in the documentation for this case would be good.

I'm starting to think that the current state of affairs where x in keys(d) and x in collect(keys(d)) don't always agree isn't so bad. For example, changing d from a Dict to an ObjectIdDict also changes the answer. Different containers can have different notions of containment.

I think the only _real_ way to synchronize x in keys(d) and x in collect(keys(d)) is to fully remove the different notions of equality they use. However the consensus seems to be that equating NaN and NaN, and distinguishing -0.0 and 0.0, are too important for that to be possible. Therefore we have to keep things as they are.

If we have isless(-0.0, 0.0) and also isequal(-0.0, 0.0), then we actually have an unnamed fourth notion of equality:

  1. ===
  2. ==
  3. == except NaNs are equal (isequal)
  4. isequal except -0.0 and 0.0 are different (isless)

One place this causes headaches is if you want to implement a lookup table that works by ordering instead of hashing. You can't just use isequal and isless the way Dicts use isequal and hash; you have to be very careful. So while this change improves some cases, I don't think it's a strict improvement overall. We should keep things as-is or fully commit to going to just 2 standard notions of equality.

@StefanKarpinski do you find that convincing?

Bump.

The Dict to ObjectIdDict comparison seems like a complete non-sequitur to me – of course those types behave differently, they're supposed to have entirely different semantics. The whole point of the keys iterator, on the other hand, is to be an efficient proxy for the collection of keys in a Dict. If keys(d) and collect(keys(d)) are going to behave differently, then I think the experiment of making keys return an iterator has failed.

What is wrong with in just using isequal instead of ==? It seems to me that this is literally the operation that isequal is designed for. The only objection made so far is that under that notion of inclusion -0.0 in [0, 0.5, 1] would be false, but I think that's an argument that we should have isequal(-0.0, 0.0) rather than that we shouldn't use isequal for testing containment.

then I think the experiment of making keys return an iterator has failed

I draw the opposite conclusion. Let's say the only way to get a collection of the keys in a dict gave them as an array. Then checking k in keys(d::ObjectIdDict) would look for k in an array, losing track of the fact that we should be using === in this case. Returning a KeyIterator instead preserves the correct comparison semantics. Given that different kinds of dicts can use different comparisons, how can keys(d) and collect(keys(d)) always behave identically?

I'd be fine with isequal(-0.0, 0.0), provided isless is consistent with that. Or, in could use x==y || isequal(x,y).

Ok, that's a good point about keys allowing === semantics to be passed through to an ObjectIdDict. It seems like the only sticking point is this annoying -0.0 thing. IEEE 754 ordering and equality is really thoroughly crappy – by far the worst part of the standard. If we made isequal(-0.0, 0.0) && !isless(-0.0, 0.0) then we would have either have -0.0 and 0.0 values end up sorted in the order they originally appeared in the input array, or we could have sorting for typed Float64 arrays sort things differently than untyped arrays (which actually use isless for comparison). I'm actually not even sure how to stably sort ±0.0 values efficiently – we do this for NaN values, which is a pain, but they end up at the end of the array, which is relatively easy to handle. Zeros end up in the middle, which is harder.

Maybe python has an algorithm for that? It uses stable sorts by default. Numpy's quicksort, however, is documented as being unstable, but is used as the default for numpy arrays.

It strikes me as whimsical to care about where negative zeros end up in a sorted array. I don't have a good feel for why that would matter.

Run-length encoding floating-point data? I have no idea. But the point is that having them just end up in some random order seems janky and broken. I bet NumPy just sorts by bit pattern. They probably don't have to agree with a stable sort for a general comparison predicate. Which, technically, we could also do. Then sorting a typed Float64 array would just end up in a different order than sorting the same values in an abstractly typed array.

I have a more mundane question - is this something we will do for 0.4?

If we come to a decision soon, then yes, but right now it's unclear what we should do, so I don't think so.

I think this is settled. As long as there is more than one kind of equality, different containers might use different kinds, so (a in b) == (a in collect(b)) is not an identity.

So the thing that still _really_ bothers me about this the original example of NaN in [1,1.5,2,NaN] being false. I realize that follows from the unfortunate definition of == for NaN due to IEEE 754, but it's just such a jarring thing for any collection. The proposal of using a disjunction of two equality predicates also isn't awesome, since it implicitly introduces yet another equality predicate, but maybe checking for x === y || x == y since at least then you would _always_ return true when checking if the _exact_ same object (or one programmatically indistinguishable from it) is in the collection.

Fair point; checking x===y || x==y seems pretty reasonable. === implies isequal, so collections that use isequal are unchanged. I guess we could go along with python here. The only concern might be performance.

I suspect that for any objects where == is cheap enough that it matters, this would get optimized. If the test is (x === y) | (x == y) this seems to get optimized away for Int and Float64 at least.

At least I wouldn't feel like a complete loser explaining that NaN in [1,1.5,2,-NaN] is false since NaN and -NaN are not the same object and they're not equal to each other.

Well, that's where I start to hesitate. Reasonable though it may seem, this is still a fourth standard kind of equality (standard in the sense that Base uses it in some important way). The results of NaN in [+-NaN] surprise _both_ those who expect == and those who expect isequal. Maybe isequal should only equate bit-identical NaNs?

In that case, I think we should start printing sign and payload for NaN, which seems annoying.

If the payload or whether it is q or s is important, and I assume that it is for certain applications (note: some libraries do optionally print out the payload), then I'd want only bit-identical NaNs to be seen as equal, at least by some measures of equality (and when using it as a key is one place I would)

Use

isequal(x, y, flag0, flagNaN)
in(x, S, flag0, flagNaN)

for internal development to prevent breaking between versions?

FWIW, the issue of -0.0 in Dict has been raised today on julia-users: https://groups.google.com/d/msg/julia-users/mVWF_v8UNZg/xJXYgP5WBQAJ

Regarding NaN, maybe we could choose a default payload which would be printed in that short form, and only print NaN(payload) when it differs from the default?

Regarding NaN, maybe we could choose a default payload which would be printed in that short form, and only print NaN(payload) when it differs from the default?

I think we generally do have a default payload, but some operations can change the sign:

julia> reinterpret(UInt64, NaN)
0x7ff8000000000000

julia> reinterpret(UInt64, -NaN)
0xfff8000000000000

It's seeming like we should hold off on changing comparison / equality / containment right now, but there's a lot of related ideas being thought about that came out of it. We'll iterate and try some changes for 0.6.

edit: we should probably break this down into smaller more focused pieces since there's a lot going on in this discussion

FWIW, I think we should:

  • make in use isequal everywhere.

    • this would allow == to throw errors (#15983).

  • make isequal(0.0, -0.0)

    • it seems weird that we have this be false, but isequal(0.0, 0) is true

  • make isless(-0.0, 0.0) == isless(0.0, -0.0)

    • My general feeling is that we should only respect the ordering of signed zeros when it's "free" (like min/max, c.f. #14621). I actually think this might be better behaviour anyway, especially when using stable sorts.

    • Since dispatch on functions is now fast, users can pass their own predicates to sort

  • create an ieeetotalorder predicate for anyone who cares about things like arbitrary standards and ordering of signed zeros

    • can also create ieeemin and ieeemax to resolve #7866

isequal(0.0, -0.0) and isless(-0.0, 0.0) == isless(0.0, -0.0) would seem to imply that zeros should be sorted stably, which I'm actually not even sure how to do efficiently :-. Unlike the NaN situation where you know they're all going to go at the end of the array, you don't know where the zeros are going to end up, so you can't just place them there as a first pass.

I don't think that is a problem, we don't guarantee that Any[0, 0.0] would be sorted stably either.

Actually, we do. We only use unstable sorts for types where instability makes no difference.

Ah, I see: not sure what we could do then.

Crazy idea: we could do a pass to check whether an array contains both -0.0 and 0.0, and if so use a stable algorithm. Or just drop the stability promise for -0.0 and 0.0.

Or just drop the stability promise for -0.0 and 0.0.

That means that for v::Vector{Float64}, sort(v) and sort(Any[v...]) will not do the same thing. That may be ok, but I'm just stating it explicitly.

I might be wrong here, but if you chose the first 0 as the first pivot, wouldn't quicksort then be stable?

Maybe? Can you make an argument for that, @simonbyrne? I'm not seeing it.

Ah, no, I realised that is wrong. There might be some clever way to do this though.

I guess one approach is to scan through the array first, counting the number of negative, zero and positive values, which determines exactly where they need to go. Then you can scan through again and move the zeros to the right region.

I don't see why -0.0 and 0.0 are special. Surely there are many user-defined datatypes where == and === differ, i.e. which compare equal although they have different identities. Shouldn't the discussion above regarding sorting stability apply there as well? Wouldn't this be covered by having two routines, sort and stable_sort, where the latter may have to use a slower algorithm? Or is stable sorting for IEEE floating-point numbers with the IEEE notion of equality in the absence of nans something that comes up often?

=== is not really in the mix here, the issue is == versus isequal. At the moment, == and isequal differ only for NaN and -0.0 – for all non-floating-point types, they should be exactly the same and you should only have to define ==. Stable sorting of IEEE numbers is not an issue since IEEE defines a total ordering, so it doesn't matter if your sort is stable or not. But IEEE tells you to sort NaNs by bit pattern, which is fast but inconvenient since NaNs with the sign bit set (signaling) are sorted before all other values and NaNs with the sign bit off (quiet) are sorted after all other values – having NaNs split like that is annoying and not what Matlab does, it sorts NaNs to the end.

The real issue is this. You want isequal to be compatible with isless in the sense that when isless(x,y) is defined, then exactly one of isless(x,y), isless(y,x) or isequal(x,y) should be true, and when isless(x,y) as not defined, then isequal(x,y) should be false and isless(y,x) also undefined. At the same time, you want sorting an array, by default, to behave exactly as if you had stably sorted it using the isless comparison function. You can already opt for an unstable sort by requesting an unstable algorithm, like QuickSort. For floating-point numbers, we optimize by stably moving NaNs to the end and then quicksorting the rest, which has the same effect as sorting the numbers with isless but is much faster. If we make isequal(-0.0, 0.0) then because of the first constraint, we must have isless(-0.0, 0.0) false. We either need to figure out a way to sort them as if they were stably sorted by comparison with isless, which means that we need to keep all the zeros in their original order by sign, or we need to give up on the notion that sort by default does stable sorting – which means that the signs of the zeros and the NaNs end up in some arbitrary order based on the details of the unstable sorting algorithm.

signaling vs quiet is the first mantissa bit, not the sign bit

signaling vs quiet is the first mantissa bit, not the sign bit

Oh! Thanks for the correction. Somehow I'd gotten it into my head that it was the sign bit.

that commit message was not accurate

changes here aren't in scope for 0.6

Triage decision: we should change this and in should use isequal 💥

I should have posted this here (or maybe somewhere else?) instead of @JackDevine's PR (sorry).

I've been thinking about this as well as our current plans for null / missing, and I note that we have some interesting behavior for == which IMO has the same problems as in as described in the OP:

julia> [NaN] == [NaN]
false

julia> Set(NaN) == Set(NaN)
true

julia> using Nulls

julia> [null] == [null]
null

julia> Set(null) == Set(null)
ERROR: MethodError: no method matching length(::Nulls.Null)
Closest candidates are:
  length(::SimpleVector) at essentials.jl:528
  length(::Base.MethodList) at reflection.jl:670
  length(::MethodTable) at reflection.jl:745
  ...
Stacktrace:
 [1] union!(::Set{Any}, ::Nulls.Null) at ./set.jl:126
 [2] Set(::Nulls.Null) at ./set.jl:19

I would make the argument that "two containers are equal when they contain (tested via element in container) the same elements". I feel it would be more consistent if == on containers checked isequal on the elements (which would make the four examples above true). Anyone know if there's been another issue discussing this?

(For context, I became interested in this because I am slightly worried about the way null has "escaped" its container here in [null] == [null] whereas previously ==(::Array, ::Array) --> Bool seemed guaranteed).

So it seems like this is potentially a good coherent position to take:

  1. Make isequal(missing, missing) true in addition to isequal(NaN, NaN) whereas (missing == missing) === missing and (NaN == NaN) === false.
  2. Use isequal and isless for all generic container code like this issue proposes.
  3. Define equality of containers in terms of isequal and isless – i.e. have [NaN] == [NaN] and [missing] == [missing].

This is a containment strategy for "unusual" scalar behaviors like those of NaN and the proposed three-value logic of missing. There are some dangers here, since if you take three-valued logic seriously, missing might be in any non-empty collection and any value might be in a collection containing missing; but I think @andyferris is probably correct: that way lies madness. So the general policy would be:

  1. Use == for scalars, 3VL applies to scalars.
  2. Use isequal for containers, 3VL does not apply to containers.

It gets a little weird when you consider containers as scalars, i.e. when doing missing == [1,2,3]. Is that true, false or missing?

The Set(null) issue is a red herring, the outlier is actually NaN due to eltype(::Float64) being defined (as for all numbers). Set(Date()) fails with the same error.

@StefanKarpinski's plan is reasonable and consistent, but I'm not sure that would really be a good idea. Let me stress that in practice I don't really care how == behaves on arrays, but I don't find the arguments in favor of changing it very convincing.

First, having == call isequal recursively sounds really strange and would make the language more complex: if you want isequal, why not use it instead of ==? We should just provide an operator for isequal (Unicode and ideally also ASCII). Note that by changing this you would fix [NaN] != [NaN], but you would get [0] != [-0.0] in exchange, which sounds possibly worse than the former.

Second, we have chosen the policy to propagate missing values everywhere to be on the safe side. That can be quite cumbersome in practice. Personally for my own work I'd find it more convenient for sum to skip missing values automatically (as major proprietary statistical software all do), but that could hide bugs for use cases where you would not have expected missing values to be present. That's just the same issue for array equality: if there's a missing value, you basically don't know whether the arrays are equal or not.

Third, I don't really see why == not returning Bool would be OK for scalars, but would suddenly become a problem for containers. What is so special about them? This sounds like a half-way reaction against 3VL which will only lead to inconsistency. I'd rather bite the 3VL bullet and apply it everywhere. I could be convinced if somebody exposed why containers should be treated in a specific way.

I agree with @nalimilan here:

First, having == call isequal recursively sounds really strange and would make the language more complex: if you want isequal, why not use it instead of ==?

Quoting @andyferris from #24563

I would say that containers would be equal when they contain (i.e. element in container) the same elements.

Boiling so much information down to one sentence helps clarify the situation a lot (thanks Andy). What I don't understand about the above sentence though, is what is meant by "equal". If #24563 goes through, then we would have:

Containers are isequal when they contain (i.e. element in container) the same elements.

But we would not have:

Containers are == when they contain (i.e. element in container) the same elements.

Like Milan says, if there's a missing value, you basically don't know whether the arrays are equal or not. However, if both arrays have a missing value in the same place then they are isequal but not ==.

Slightly related question: should the type of the container matter?

E.g., if I have a Dict with consecutive integer keys, and an array with the same key-value pairs, would these be isequal (or ==)? (I assume not isequal).

More interesting for me, would Set([1,2,3]) == OrderedSet([1,2,3]) (using OrderedSet from DataStructures.jl)? Here, should the ordering be considered when comparing containers of different types?

The other way to go with this is to put missing in Base and double down on 3VL everywhere.

First, having == call isequal recursively sounds really strange

The idea here is that == and isequal should really be the same function, and therefore should be the same in as many cases as possible. However, adding missing increases the need for a distinction between == and isequal. If ([missing] == [missing]) === missing is desired, then I agree we should put missing in Base.

I do think the type of container matters. For example, we have this ugly check:

function ==(l::Associative, r::Associative)
    ...
    if isa(l,ObjectIdDict) != isa(r,ObjectIdDict)
        return false

It's pretty clear to me that we should have a keycompare(dict) function that returns the key comparison function the dict uses (isequal for Dict and === for ObjectIdDict), and that this code should check keycompare(l) != keycompare(r).

All of that is to say that the comparison function used determines the meaning of the object, and so == should take it into account. On that basis I think Set([NaN]) == Set([NaN]) is defensible, and the same would probably apply to missing.

Makes sense. Maybe it's not so bad to say that in uses == for arrays and tuples, but isequal for dicts and sets? The idea being that you need to use isequal to be able to access a dict entry with its key, but that it's not required for other collections. Using == with arrays would ensure -0.0 in [0] -> true and missing in [1] -> missing, which appear as two desirable properties to me. OTOH we would have NaN in [NaN] -> false, but I find it less problematic than the -0.0 case (since NaN are not valid numbers, so they need special treatment anyways).

Bump,

Will it be possible to make a decision on this before the feature freeze? I am not completely sure what it is that we want to do here. However, if we come up with a solid plan, then I am happy to help out.

I see the minimal effort version of a coherent story for container == going like:

  • Elements of AbstractSets and the keys of Associatives are compared by isequal.
  • Elements of Tuples, AbstractArrays and the values of Associatives are compared by ==.
  • Other types like ObjectIdDict can specialize these rules as necessary (I kind of see ObjectIdDict as quite special... I can't think of many cases where it would be beneficial to deviate from the above).

(It's also possible for us to state that the keys of AbstractArrays (and Tuples) are compared via isequal, because there isn't really a difference for integers).

That sounds good, are you arguing that we make in follow the above rules?

Sorry, I forgot to mention in.

Yes, in the precise sense of my opening statement above - the minimal effort version is to do very little, and cover it over with a convenient and somewhat coherent story afterwards :)

Going into more detail, the data people appear to consider missing as "any one of a range of values" so that the predicate missing in [missing] could actually be seen as missing (i.e. is some unknown value contained in a vector which contains some unknown value? the answer is "maybe", it depends on the values that you don't have information about). Actually, if you take that reading of the meaning of missing to it's logical conclusion, then missing in [1,2,3] is missing, and 0 in [1,2,missing] is missing, which seems... interesting...

Sorry to wax philosophically here, but to me the choice between in using == or isequal is becoming more of a choice of whether this is a "software engineering" concern of containers, or a "data question" we are asking about values. With missing, we may end up in the situation where if a == b, if a < b, and potentially if a in B are not "safe" around missing data (things we thought were safe boolean predicates will no longer be, and Base and user packages will need to adjust; the benefits might be worth the effort since missing data seems difficult to deal with in many other languages and frameworks - we'll find out).

I think the answer is that sometimes you will be doing some software engineering, and sometimes you will be doing some data science, and in both cases you might want to test for containment. For == and < we have isequal and isless for the safe, conservative, software engineering question. For in, perhaps we also need a partner function like isin or contains, one using isequal and one using == as the test? (There's a case for saying that all the "predicates" that will be "infected" (sorry for negative choice of word) by missing will need a "safe" version that always returns Bool - if only so we can continue to use if statements :) )

The reasoning behind what I wrote in my last post is that testing whether a key exists seems more like software concern where in getindex you must find a match (or not) and return some data (or not) and it can't be ambiguous which one, but the values of dictionaries and the data inside arrays is "user data". This seemed like a "sensible default semantic", but I'd also argue for a way to control this as necessary (hence I mention introducing isin or contains for this purpose).

However, even that default semantic is tricky because sometimes keys will come from data (say a groupby operation) and sometimes (frequently) sets are "data".

Considering this more, I wonder if having in always use == and a new contains function always use isequal might be the simplest way forward?

It would let us use NaN and missing as a key when necessary (e.g. haskey(A, k) = contains(k, keys(A)) will work and always returns a Bool) and still let users interrogate "data" with in.

I don't think we want both in and contains. And x in keys(dict) should always use whatever key comparison function the dict uses.

Ok, so I think that the current plan is to have in use isequal for Sets and the keys of Associative, in uses == for Arrays and Tuples and we have to specialize as necessary for ObjectIdDict.

If people think that the above is reasonable, then I could make the changes and add a little bit to the docstring that describes the intention of in as well as the rules for people who want to create new methods for in.

Revisiting this, I tend to think we should leave it as-is. Currently in has two kinds of behaviors:

  1. For containers with explicit ideas about what they contain (like dicts and sets), the container determines the equality test used.
  2. Everything else uses ==.

This is nice and simple. If we don't want to change rule (2) to use isequal instead of ==, but we want to change something, then we have to move to 3 kinds of behaviors instead of 2. I think that would be bad; for example x in (f(y) for y in itr) should give the same answer as x in [f(y) for y in itr].

Sorry - dumb question - what's not a container but supports in? (Edit: maybe I misread that slightly)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dpsanders picture dpsanders  Â·  3Comments

ararslan picture ararslan  Â·  3Comments

StefanKarpinski picture StefanKarpinski  Â·  3Comments

i-apellaniz picture i-apellaniz  Â·  3Comments

iamed2 picture iamed2  Â·  3Comments