Julia: custom hashing is too easy to accidentally break

Created on 17 Jul 2015  Â·  49Comments  Â·  Source: JuliaLang/julia

If you define a custom == for a type and don't define a matching hash method for it, it silently breaks hashing of that type:

julia> type Foo
           x::Int
       end

julia> ==(a::Foo, b::Foo) = a.x == b.x
== (generic function with 109 methods)

julia> [unique([Foo(4), Foo(4), Foo(4)]) for _ = 1:25]
25-element Array{Array{Foo,1},1}:
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4)]
 [Foo(4),Foo(4),Foo(4)]

Whether you get one, two or three values depends on whether the last digits of their address in memory happen to be the same or not. This is because hashing is based on object identity, which is based on address. See discussion here:

Having it this easy to _silently_ break the hashing of a type seems like a bad design. Although the problem is worse for mutable types since people inevitably want to overload == for them, it's not just a problem for mutable objects:

julia> immutable Bar
           x::Int
           y::Int
       end

julia> ==(a::Bar, b::Bar) = a.x == b.x
== (generic function with 110 methods)

julia> [unique([Bar(i,i+j), Bar(i,2i+j), Bar(i,i+2j)]) for i=1:25, j=1:3]
25x3 Array{Array{Bar,1},2}:
 [Bar(1,2),Bar(1,3)]                 [Bar(1,3),Bar(1,4)]                 [Bar(1,4),Bar(1,5),Bar(1,7)]
 [Bar(2,3),Bar(2,5),Bar(2,4)]        [Bar(2,4),Bar(2,6)]                 [Bar(2,5),Bar(2,7),Bar(2,8)]
 [Bar(3,4),Bar(3,7),Bar(3,5)]        [Bar(3,5),Bar(3,8),Bar(3,7)]        [Bar(3,6),Bar(3,9)]
 [Bar(4,5)]                          [Bar(4,6),Bar(4,10),Bar(4,8)]       [Bar(4,7),Bar(4,11),Bar(4,10)]
 [Bar(5,6),Bar(5,11),Bar(5,7)]       [Bar(5,7),Bar(5,12),Bar(5,9)]       [Bar(5,8),Bar(5,13),Bar(5,11)]
 [Bar(6,7),Bar(6,13)]                [Bar(6,8),Bar(6,14),Bar(6,10)]      [Bar(6,9),Bar(6,15)]
 [Bar(7,8),Bar(7,15),Bar(7,9)]       [Bar(7,9),Bar(7,16),Bar(7,11)]      [Bar(7,10),Bar(7,17),Bar(7,13)]
 [Bar(8,9),Bar(8,17),Bar(8,10)]      [Bar(8,10),Bar(8,18),Bar(8,12)]     [Bar(8,11),Bar(8,19),Bar(8,14)]
 [Bar(9,10),Bar(9,19),Bar(9,11)]     [Bar(9,11),Bar(9,20),Bar(9,13)]     [Bar(9,12),Bar(9,21),Bar(9,15)]
 [Bar(10,11),Bar(10,21),Bar(10,12)]  [Bar(10,12),Bar(10,22),Bar(10,14)]  [Bar(10,13),Bar(10,23),Bar(10,16)]
 [Bar(11,12),Bar(11,23),Bar(11,13)]  [Bar(11,13),Bar(11,24),Bar(11,15)]  [Bar(11,14),Bar(11,25),Bar(11,17)]
 [Bar(12,13),Bar(12,25),Bar(12,14)]  [Bar(12,14),Bar(12,16)]             [Bar(12,15),Bar(12,18)]
 [Bar(13,14),Bar(13,27),Bar(13,15)]  [Bar(13,15),Bar(13,28),Bar(13,17)]  [Bar(13,16),Bar(13,29),Bar(13,19)]
 [Bar(14,15),Bar(14,29),Bar(14,16)]  [Bar(14,16),Bar(14,30)]             [Bar(14,17)]
 [Bar(15,16),Bar(15,31),Bar(15,17)]  [Bar(15,17),Bar(15,32),Bar(15,19)]  [Bar(15,18),Bar(15,33),Bar(15,21)]
 [Bar(16,17),Bar(16,33),Bar(16,18)]  [Bar(16,18),Bar(16,34)]             [Bar(16,19),Bar(16,35)]
 [Bar(17,18),Bar(17,35),Bar(17,19)]  [Bar(17,19),Bar(17,36),Bar(17,21)]  [Bar(17,20),Bar(17,37),Bar(17,23)]
 [Bar(18,19),Bar(18,37),Bar(18,20)]  [Bar(18,20),Bar(18,38),Bar(18,22)]  [Bar(18,21),Bar(18,39),Bar(18,24)]
 [Bar(19,20),Bar(19,39),Bar(19,21)]  [Bar(19,21),Bar(19,23)]             [Bar(19,22),Bar(19,41),Bar(19,25)]
 [Bar(20,21),Bar(20,22)]             [Bar(20,22),Bar(20,42),Bar(20,24)]  [Bar(20,23),Bar(20,43),Bar(20,26)]
 [Bar(21,22),Bar(21,43)]             [Bar(21,23),Bar(21,44),Bar(21,25)]  [Bar(21,24),Bar(21,45),Bar(21,27)]
 [Bar(22,23),Bar(22,45),Bar(22,24)]  [Bar(22,24),Bar(22,46),Bar(22,26)]  [Bar(22,25),Bar(22,47),Bar(22,28)]
 [Bar(23,24),Bar(23,47),Bar(23,25)]  [Bar(23,25),Bar(23,48),Bar(23,27)]  [Bar(23,26),Bar(23,49),Bar(23,29)]
 [Bar(24,25),Bar(24,49),Bar(24,26)]  [Bar(24,26),Bar(24,50),Bar(24,28)]  [Bar(24,27),Bar(24,51),Bar(24,30)]
 [Bar(25,26),Bar(25,51)]             [Bar(25,27),Bar(25,52),Bar(25,29)]  [Bar(25,28),Bar(25,53),Bar(25,31)]
collections

Most helpful comment

Here's a different perspective on this (based on discussion with Jeff)... Using a hash table is only one way to implement dictionary and set data structures. Another perfectly valid and often just fine implementation is a linear array of values. We can consider that to be the default implementation, and providing a hash function for a type is a way to tell the system, "hey, this type is hashable, so use a fancy hash table implementation of dict and set instead of the standard linear array implementation" – and the system will magically do that for you. This works because in the case where hash is the constant zero function, a hash table degenerates to a linear array implementation.

All 49 comments

Yup, linked to that in the middle of that. Probably should have put it at the end.

Oops, sorry, I should have read more carefully.

No worries – I should have posted more clearly :-)

@StefanKarpinski Is there something more we need to worry about if we just cache the hash or otherwise add additional comparason of the hash as you proposed somewhere in that thread? We could just throw an error in such case.

And by such case I mean when they are == but the hash is not.

That would fix it, but you would only get the error in the cases where two hashes accidentally "collide". I'm kind of into the idea of defining == and hash generically in terms of some identity_parts function or something like that.

That would fix it, but you would only get the error in the cases where two hashes accidentally "collide".

Getting a predictable result (unique retuning 3 items in the case you have above) or a clear error is argueably the best response I can imagine for an undefined behavior.

in terms of some identity_parts function

I just think it might be kind of a headache for type stability for types that allows "missing values".

Getting a predictable result (unique retuning 3 items in the case you have above) or a clear error is argueably the best response I can imagine for an undefined behavior.

I agree that it might be hard to hit the error and notice the undefined behavior in certain scenario so I'm not saying this is the best we can do either (e.g. if identity_parts works).

I just think it might be kind of a headache for type stability for types that allows "missing values".

Actually since we have such a type. The question is basically, how do you define identity_parts for Nullable.

Feels like it should return a Tuple{Bool, Union{}} for type stability if it's a missing value. Would be funny if it's not type stable for Nullable given it's the whole point of the type......

I guess we should just remove the fallback for hash that uses object_id.

I guess we should just remove the fallback for hash that uses object_id.

Can I frame this ? Many hours wasted in debugging (lack of recompilation + wrong bootstrap order = hashing by object id unexpectedly)

Horray!

Huh? What have you guys been smoking hashing?

hashing my life away

Coke

Question: What is an "identity tuple"? Is the idea that if identity_parts(a) and identity_parts(b) return the same identity tuple then they get the same hash and == returns true?

For Nullables, == currently throws NullException regardless of whether or not the compared values are missing or present. How does that fit into the proposed framework?

If y'all decide to try the identity_parts route, I've got a potential name for the method: haecceity (explanation here). Though I wouldn't want to offend any nominalists/radical empiricists out there. =p

That is an awesome vocab word!

Personally I would be against this sort of interface obscurity: "oh, if you want to define ==, you should _actually_ define this other non-obvious function instead."

One wouldn't be _required_ to define haecceity though - overloading == _and_ hash would still work just fine. The new function would just make it easier to create the overloads for most straightforward cases.

I think one of the reasons many users don't create a hash method for their types is that they don't really know how to. (Should I just hash the components I want to matter? How should I combine them? What if my hashing function is incorrect?) The fear of breaking something actually makes people break that exact thing. Being able to say to those users "hey, just use this function to tell us what parts of your object are important, and we'll figure out the rest for you" is awesome.

However, if == and hash are broken if they don't correspond, would ==(x, y) = ==(hash(x), hash(y)), with some specialized method for ints, be a terrible idea?

@JeffBezanson, the weird thing is that both equality and hashing are clearly based on extrapolations of the same question – "which pieces of this thing are essential?" Equality takes two objects and does a pairwise recursive equality check on their essential parts; hashing recursively hashes the essential parts of an object together with some futzing around with types to make sure that hashes don't collide. All of that stuff could be completely automated – telling the language what parts of a thing are important and getting == and isequal and hash for free seems like a good deal to me. As @tlycken points out, you don't _need_ to do it this way, it would just be available to you and quite convenient.

I think that's pretty easy to do using

==(a::MyType, b::MyType) = (a.x, a.z) == (b.x, b.z)
hash(a::MyType, h) = hash((a.x, a.z), h)

So you're type has hash collisions with tuples of its field values?

I can't find anything in the manual either about exactly _how_ a best-practice implementation of == and hash should look like for a custom composite type. The closest I get is the paragraph above this headline (sorry for the weird linking, but it was very far to scroll from the next headline above... click the link and scroll up a little) but the only thing mentioned there is that

For other [in context: non-numerical types] types, isequal() defaults to calling ==(), so if you want to define equality for your own types then you only need to add a ==() method. If you define your own equality function, you should probably define a corresponding hash() method to ensure that isequal(x,y) implies hash(x) == hash(y).

Regardless of whether or not it's "really simple" to implement both ==() and hash(), at the very least the correct way to do it needs to be better documented.

And I still feel that providing a shortcut for users that want one is a really nice feature (one that I've missed in many other settings, including Java and C#).

My point is that the proposed mechanism is not significantly simpler, and just adds more stuff you have to know about. We could recommend using hash((a.x, a.z), hash(typeof(a), h)).

The get_parts interface has a performance pitfall, in that it's too easy to end up copying a whole collection: parts(x::MyType) = (x...,).

I can't find anything in the manual either about exactly how a best-practice implementation of == and hash should look like for a custom composite type.

I agree: https://github.com/JuliaLang/julia/pull/11794#issuecomment-113861001

@mbauman Very nice indeed! I had not seen that PR before, but it makes me very happy :)

Ok, so back to the issue at hand. It seems like we should at least remove the default object-identity hashing for mutables. Should we just leave immutables alone?

I started experimenting with this, and surprisingly found quite a few places in Base that depend (correctly) on object_id hashing for mutables: Type, Method, LambdaStaticData, Function, Module
So I'm afraid this is a very breaking change. The current approach "just works" for the fairly common case where you're associating data with a particular object. This trades off against the problem in this issue. Our policy has often been that if you're defining a new type, making it work is on you, and all we can really offer here is a no-method error, which might not be worth the breakage this causes.

If we still want to keep the default object id hash method, raising error when there's a custom defined == should be fine then?

The current approach "just works" for the fairly common case where you're associating data with a particular object.

Shouldn't we be using ObjectIdDicts for that?

Not necessarily:

  1. Sometimes people like extra type annotations, and use something like Dict{Function,Any} (this occurred in the doc system). Sometimes people only know about Dict.
  2. In one case we need a WeakKeyObjectIdDict. Could be added of course.
  3. Type inference hashes tuples (Method, Type, env). This could maybe get by with an ObjectIdDict, but in general you might have combinations of identity-hashed data and other data.

Perhaps only WeakKeyObjectIdDict should exist? It seems like any place one is currently using a WeakKeyDict it's quite likely that the hashing should be based on object identity.

Yes, that does make some amount of sense.

Working on this has exposed some interesting issues and missing hash methods, which is good. I might need some help on this one:

exception on 2: ERROR: LoadError: test error in expression: sanity_tst(deps_data)
MethodError: `hash` has no method matching hash(::Base.Pkg.Types.Available, ::UInt64)
Closest candidates are:
  hash(::ANY, ::UInt64)
  hash(!Matched::Tuple{}, ::UInt64)
  hash(!Matched::Tuple{Any}, ::UInt64)
  ...
 in hash at ./hashing.jl:5
 in ht_keyindex at ./dict.jl:540
 in unique at ./set.jl:110
 in prune_versions at ./pkg/query.jl:174
 in sanity_check at pkg/resolve.jl:56
 in __sanity_tst#21__ at /home/jeff/src/julia/test/resolve.jl:93
 in sanity_tst at /home/jeff/src/julia/test/resolve.jl:100
 in anonymous at test.jl:92
 in do_test at test.jl:50

It looks like a few Pkg types (e.g. Available) define == but not hash. Does id hashing work for them?

Yikes, not sure, will check it out...

WeakKeyObjectIdDict was implemented in #7348, in case someone needs inspiration...

@JeffBezanson, any other things in Base that didn't have consistent == and hash methods?

This is a disturbing example: https://groups.google.com/forum/#!topic/julia-users/2e-vjq-5mhI from the perspective of hashing behavior.

Again: https://groups.google.com/forum/#!topic/julia-users/3NynLE00wIY. I really think we should delete the fallback hashing behavior for 0.5 and take the breakage now. There can be a method for "standard hashing" that people can define if they want to opt into this behavior, but it would be _much_ better if hashing things just broke in these cases. Or maybe the default hashing function can check if the type has custom equality defined and raise an error if it does.

Ok, let's delete it. Scrolling up it looks like we were working on it and just got sidetracked.

Resolution: remove the default hash method and make sure the API for defining a new hash function is nice and simple to use.

Update: started working on this in #24354. It revealed that there are many uses of Dict that should really be ObjectIdDict, but dropping the type parameters loses a lot of type information (often of value as documentation). Hence #25210 adds IdDict{K,V}, and is near ready to merge.

The primary issue is that you can no longer naively ask whether a random object is in a set or dictionary, which can be pretty annoying. E.g. rewriting x in [1, 2] to x in Set([1, 2]) is no longer safe and might cause an error. We have some functions like unique that internally rely on this.

So I have another (slightly crazy) idea: instead of removing the hash fallback, make it return 0 instead. That way it's correct by default, just inefficient. Then defining hash methods becomes an optimization you do as needed.

I think that crazy idea may actually be quite brilliant.

This has the potential to turn the "unique [etc] does not work"-questions into "unique [etc] is super slow"-questions? However, then there is an obvious place for the afflicted to check: "Performance tips".

All in all, this seems a very Julian solution, +1.

Here's a different perspective on this (based on discussion with Jeff)... Using a hash table is only one way to implement dictionary and set data structures. Another perfectly valid and often just fine implementation is a linear array of values. We can consider that to be the default implementation, and providing a hash function for a type is a way to tell the system, "hey, this type is hashable, so use a fancy hash table implementation of dict and set instead of the standard linear array implementation" – and the system will magically do that for you. This works because in the case where hash is the constant zero function, a hash table degenerates to a linear array implementation.

Best-of-both-worlds idea from Stefan: check whether == for an object calls the default definition, and if so use object_id. Otherwise return a constant. That way we get the benefits of this idea without hurting performance of objects that work now. That depends on our ability to optimize such a check, but that can be added later and will only break code that somehow depends on the current hash function being wrong for their type (which is already broken).

Nice thing about the best-of-both-worlds idea solution is that it's strictly non-breaking: code that was calling hash for whatever reason continues to work – it gets a zero where it would have gotten an objectid value before, but that's fine; code that was actually putting objects into dicts or sets without defining a custom hash function will magically become unbroken, albeit potentially slower.

As a added bonus getting this feature:

That depends on our ability to optimize such a [method-exists] check

sounds super useful to define other fast-dispatching traits in terms of defined methods, yay!

Was this page helpful?
0 / 5 - 0 ratings