Runtime: [Proposal] GetLongHashCode()

Created on 17 Feb 2019 · 9Comments · Source: dotnet/runtime

Basically, GetHashCode() but returning long instead. Enabled for 64bits only (?)

This would be useful to avoid hashcode collitions with big collections and collections that uses long/ulong as key.

It could co-exist along with GetHashCode or change GetHashCode return type to long if targeted platform is 64 bits.

api-suggestion area-System.Runtime

Source

JustNrik

Most helpful comment

I know this is a cross post but I think it's applicable here too.

Apart from being a pretty big change to the CLR (adding a new virtual member to System.Object, changing the definition of GetHashCode is definitely not going to happen) and BCL, none of the existing collections could possibly work with this. I doubt (m)any of them support more than 2^32 items anyway, and I doubt that there's much call in the BCL to provide such large collection types so they would likely push for such a thing to be provided via NuGet.

It'd be completely possible for you to implement such collections along with a version of IEqualityComparer that produced and used 64-bit hash codes.

public interface IBigEqualityComparer<T> : IEqualityComparer<T> {
    long GetLongHashCode();
}

public class BigDictionary<TKey, TValue> : IDictionary<TKey, TValue> {
    public BigDictionary(IBigEqualityComparer<TKey> comparer) { ... }
}

HaloFour on 17 Feb 2019

👍7

All 9 comments

I know this is a cross post but I think it's applicable here too.

It'd be completely possible for you to implement such collections along with a version of IEqualityComparer that produced and used 64-bit hash codes.

public interface IBigEqualityComparer<T> : IEqualityComparer<T> {
    long GetLongHashCode();
}

public class BigDictionary<TKey, TValue> : IDictionary<TKey, TValue> {
    public BigDictionary(IBigEqualityComparer<TKey> comparer) { ... }
}

HaloFour on 17 Feb 2019

👍7

As far as I know, most collection types use int for the indexers, so even if you could have more than 2^32 items in them, you wouldn't be able to index into them properly. (This is the case for Array, ICollection, IList and many more).

Whilst I can see the hash collision benefits of having a 64-bit hash instead of a 32-bit hash, there are custom implementations out there for GetHashCode that should get you a "good enough" hash code within the 32-bit restrictions.

With that said, most of the time (especially in BCL) the hash code is used for allocating objects things like buckets in hash tables, and rarely for proving uniqueness. A collision shouldn't be fatal in these cases.

jduncanator on 18 Feb 2019

👍1

"that should get you a "good enough" hash code within the 32-bit restrictions."
. With a "perfect" 32 bit hash functions you will typically (50% chance) run into collisions with less than 80k objects in a dictionary. even 64 bit hash functions have a 1% chance of collision at 600 million objects. (exact maths on collision probability can be found from https://preshing.com/20110504/hash-collision-probabilities/)

AartBluestoke on 22 Feb 2019

😄3

Like I mentioned in my comment, GetHashCode is not designed to be used as a test of uniqueness, but rather to be a quick way to allocate items into hash based collections. In these instances, a hash collision doesn't necessarily impact the performance of the collection negatively.

jduncanator on 22 Feb 2019

With a "perfect" 32 bit hash functions you will typically (50% chance) run into collisions with less than 80k objects in a dictionary.

But a hash collision only decreases performance for the colliding values. If 2 out of 80k values are slightly slower, then that's almost certainly still fine. On the other hand, using 64-bit hash codes is probably going to negatively affect performance for all values.

This means there are two questions that I would like to see answered:

How many items do you have to have for collisions to significantly decrease performance of existing (32-bit hash code based) collections?
What is the overhead of switching to 64-bit hash codes?

svick on 25 Feb 2019

Even if the object were producing a 64-bit hash code, most things that use a hash code then use it mod-bucketsize; so it feels like all this would really do is make a 64-bit division happen instead of a 32-bit division to then get back at the same smallish number of buckets, which is already an integer; so it's the same as taking the 64-bit hash code and dropping the top 32 bits in the first operation (assuming, of course, that all bits are independent).

bartonjs on 26 Feb 2019

👍3

Seems like the "hash-based collections won't benefit from this" discussion has already taken place. Is there another scenario that might benefit from this API?

GrabYourPitchforks on 21 Jan 2020

Another issue not mentioned above is collections that store the hashcode (like Dictionary<K,V>) would grow significantly bigger if we store a 64 bit hashcode. They could not bother storing it if there were vanishingly few collisions, but as others pointed out above, assuming the existing hashes are well distributed, that is a properly of the load factor not the size of the hashcode until the collection approaches the size of the hashcode.

A 64 bit hashcode would probably only be useful for a collection that stored far more entries than the existing ones we have -- such that they need more than ~32 bits worth of buckets. Those specialized collections could have their own custom ways of getting hashcodes. Maybe you have to provide a hashcode factory.

danmosemsft on 21 Jan 2020

👍1

I think we can close this, as @HaloFour's suggestion is good.

It'd be completely possible for you to implement such collections along with a version of IEqualityComparer that produced and used 64-bit hash codes.

If you do create a set of reusable collections that can contain huge numbers of entries, perhaps you'd consider putting them on Nuget and sharing a link here? 😄

danmosemsft on 21 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings