Basically, GetHashCode() but returning long
instead. Enabled for 64bits only (?)
This would be useful to avoid hashcode collitions with big collections and collections that uses long/ulong as key.
It could co-exist along with GetHashCode or change GetHashCode return type to long if targeted platform is 64 bits.
I know this is a cross post but I think it's applicable here too.
Apart from being a pretty big change to the CLR (adding a new virtual
member to System.Object
, changing the definition of GetHashCode
is definitely not going to happen) and BCL, none of the existing collections could possibly work with this. I doubt (m)any of them support more than 2^32 items anyway, and I doubt that there's much call in the BCL to provide such large collection types so they would likely push for such a thing to be provided via NuGet.
It'd be completely possible for you to implement such collections along with a version of IEqualityComparer
that produced and used 64-bit hash codes.
public interface IBigEqualityComparer<T> : IEqualityComparer<T> {
long GetLongHashCode();
}
public class BigDictionary<TKey, TValue> : IDictionary<TKey, TValue> {
public BigDictionary(IBigEqualityComparer<TKey> comparer) { ... }
}
As far as I know, most collection types use int
for the indexers, so even if you could have more than 2^32 items in them, you wouldn't be able to index into them properly. (This is the case for Array
, ICollection
, IList
and many more).
Whilst I can see the hash collision benefits of having a 64-bit hash instead of a 32-bit hash, there are custom implementations out there for GetHashCode
that should get you a "good enough" hash code within the 32-bit restrictions.
With that said, most of the time (especially in BCL) the hash code is used for allocating objects things like buckets in hash tables, and rarely for proving uniqueness. A collision shouldn't be fatal in these cases.
"that should get you a "good enough" hash code within the 32-bit restrictions."
. With a "perfect" 32 bit hash functions you will typically (50% chance) run into collisions with less than 80k objects in a dictionary. even 64 bit hash functions have a 1% chance of collision at 600 million objects. (exact maths on collision probability can be found from https://preshing.com/20110504/hash-collision-probabilities/)
Like I mentioned in my comment, GetHashCode is not designed to be used as a test of uniqueness, but rather to be a quick way to allocate items into hash based collections. In these instances, a hash collision doesn't necessarily impact the performance of the collection negatively.
With a "perfect" 32 bit hash functions you will typically (50% chance) run into collisions with less than 80k objects in a dictionary.
But a hash collision only decreases performance for the colliding values. If 2 out of 80k values are slightly slower, then that's almost certainly still fine. On the other hand, using 64-bit hash codes is probably going to negatively affect performance for all values.
This means there are two questions that I would like to see answered:
Even if the object were producing a 64-bit hash code, most things that use a hash code then use it mod-bucketsize; so it feels like all this would really do is make a 64-bit division happen instead of a 32-bit division to then get back at the same smallish number of buckets, which is already an integer; so it's the same as taking the 64-bit hash code and dropping the top 32 bits in the first operation (assuming, of course, that all bits are independent).
Seems like the "hash-based collections won't benefit from this" discussion has already taken place. Is there another scenario that might benefit from this API?
Another issue not mentioned above is collections that store the hashcode (like Dictionary<K,V>
) would grow significantly bigger if we store a 64 bit hashcode. They could not bother storing it if there were vanishingly few collisions, but as others pointed out above, assuming the existing hashes are well distributed, that is a properly of the load factor not the size of the hashcode until the collection approaches the size of the hashcode.
A 64 bit hashcode would probably only be useful for a collection that stored far more entries than the existing ones we have -- such that they need more than ~32 bits worth of buckets. Those specialized collections could have their own custom ways of getting hashcodes. Maybe you have to provide a hashcode factory.
I think we can close this, as @HaloFour's suggestion is good.
It'd be completely possible for you to implement such collections along with a version of IEqualityComparer that produced and used 64-bit hash codes.
If you do create a set of reusable collections that can contain huge numbers of entries, perhaps you'd consider putting them on Nuget and sharing a link here? 馃槃
Most helpful comment
I know this is a cross post but I think it's applicable here too.
Apart from being a pretty big change to the CLR (adding a new
virtual
member toSystem.Object
, changing the definition ofGetHashCode
is definitely not going to happen) and BCL, none of the existing collections could possibly work with this. I doubt (m)any of them support more than 2^32 items anyway, and I doubt that there's much call in the BCL to provide such large collection types so they would likely push for such a thing to be provided via NuGet.It'd be completely possible for you to implement such collections along with a version of
IEqualityComparer
that produced and used 64-bit hash codes.