Runtime: Use more robust hash code combiner in ValueTuple

Created on 28 Nov 2016 · 29Comments · Source: dotnet/runtime

Use random seed to randomize the hashcodes
Use 64-bit state to combine the hashcodes (e.g. use the marvin32 hashcode combiner that is used for string hashing in coreclr)

Note: string.GetHashCode has the right implementation.

area-System.Runtime enhancement up-for-grabs

Source

jkotas

Most helpful comment

It's general dotnet org policy (all MS orgs / repos have the same policy). I think overall 2FA is useful -- people who switch computers often are IMO more an exception these days. (the only scenario I can imagine is students ...)

Each computer switch requires you to clone your repo locally - that takes time. Is the 2FA really that much time-consuming compared to cloning? I would guess it is just minor addition to the already-time-consuming process you have to go through. Or did I misunderstand your scenario/usage pattern?

karelz on 12 Dec 2016

👍2

All 29 comments

(forked from https://github.com/dotnet/corefx/issues/8034)

cc @VSadov @jcouv

jkotas on 28 Nov 2016

@jkotas How would we get the random seed if we were to do this? Just use Environment.TickCount or something similar?

jamesqo on 28 Nov 2016

For reference to everyone, I found a readable implementation of Marvin32 here:

https://github.com/floodyberry/Marvin32/blob/master/Marvin32.c

jamesqo on 28 Nov 2016

Environment.TickCount is not sufficient source of randomness. Guid.NewGuid() should be good enough.

jkotas on 28 Nov 2016

👍1

We will need this in 1.2 to avoid similar mistakes we did with string in the past.

karelz on 28 Nov 2016

@gafter pointed out a layering concern with using Guid.NewGuid(), as tuple is at a lower layer than guid. Presumably, we could refactor Guid to store its state in a tuple.
Would new object().GetHashCode() be a workable alternative?

jcouv on 28 Nov 2016

as tuple is at a lower layer than guid

Nope. ValueTuple is slightly higher implementation layer than Guid. Guid is in System.Runtime contract. ValueTuple implementation depends on System.Runtime, as well as System.Collections, System.Diagnostics.Debug and System.Resources.ResourceManager.

jkotas on 28 Nov 2016

I have concerns that randomizing the hash function too often would be breaking.

If we have, for example, different hash functions in different appdomains, a trivial serializing/deserializing of Dictionary<K,V> between appdomains will make the instance completely broken.

Basically having different hash functions within the same process seems bad.

I am not even sure about "same machine" case. There could be some distributed systems (caches or whatever), that might rely on stable hashing.
Considering that System.Object will not keep its hash in multiprocess environment it might be ok for ValueTuple too. Different hashing within same process would be strange though.

VSadov on 28 Nov 2016

a trivial serializing/deserializing of Dictionary between appdomains will make the instance completely broken

This is not true. The hashcodes between appdomains are often different today already. (And .NET Core does not have appdomains but that's a secondary point.)

There could be some distributed systems (caches or whatever), that might rely on stable hashing.

Such systems need to use their own stable hash functions. As mentioned in dotnet/runtime#17100, string hashcodes are randomized by default in .NET Core. It is done on purpose to mitigate attacks that exploit predictable hash collisions. This issue is about extending this mitigation to ValueTuple since it is likely going to be a popular type for hashtable keys.

jkotas on 28 Nov 2016

👍1

@jkotas, @VSadov If we decide to make these changes, would it be OK if I submitted a PR for this?

jamesqo on 28 Nov 2016

It's yours - I will reassign it to you once you accept the contributor invitation (limitation of GitHub when you are not the one who file d the issue :()

karelz on 29 Nov 2016

@karelz Sure, I accepted the invitation.

jamesqo on 29 Nov 2016

All yours now.

karelz on 29 Nov 2016

@VSadov suggested that marvin32 is overkill for combining just 2 or 3 hashcodes. I tend to agree.

jkotas on 29 Nov 2016

@jkotas Ok. If we do not use Marvin32 for tuples of arity 2 and 3 however, I do not think we should use it for the other tuple types. We should keep the hashing algorithm consistent. Are there any other good hashing algorithms that accept 4 bytes at a time?

Also since this is related to security, I should mention a couple of posts I read the other day which describe how it's possible to cause collisions in many popular non-crypto hashes: here and here. Pretty interesting/scary to read; they essentially 'invert' hashes so, when it is fed back into the algorithm, that same hash is produced.

jamesqo on 29 Nov 2016

Are there any other good hashing algorithms that accept 4 bytes at a time?

@jamesqo Hmm, I am not sure actually. All algorithms I can think of would do bit mixing just like Marvin32. You can argue whether it is ok to omit some of the barel-shifts, but there is really no good answer for it.

jkotas on 29 Nov 2016

@jkotas - the arguments for per-appdomain sound convincing.
We can always start randomizing at bigger granularity if there is too much apparent breakage.

VSadov on 30 Nov 2016

As for beefing up hash combiner as a security measure - I am not sure this is the most effective use of cycles.

If elements have bad hash functions in a first place (say they use just two values), no matter how much you rotate and mix them you will not increase the result range.
If elements have good hash functions - then combining via XOR is fast and efficient. The only problem with XOR is that it has obvious funnels (tuple with 2 equal elements would collapse into 0 hash and such)

I think the goal for the combiner is to keep as much entropy from the sources, not have trivial funnels, and be fast.

VSadov on 30 Nov 2016

@jamesqo Thanks for the help.
From last discussion with @jkotas and @VSadov, we're ok to move forward with randomization/seeding, but would rather hold back on adopting a more expensive hash function.
Doing the randomization keeps the door to changing the hash function in the future if the need arises.

jcouv on 30 Nov 2016

👍1

@jcouv OK, sounds like a good plan.

@jkotas If ValueTuple will continue to use the same hash, what is the plan for HashCode (#8034)? Do you still want to use Marvin32 there? Or can we take the same approach and stick with the current algorithm, but insert a random seed at the beginning?

jamesqo on 30 Nov 2016

@jcouv seed randomization per domain sounds good enough.
@jamesqo @jkotas the current algorithm has good mixing properties, it is very fast and easily inlineable by the JIT, if there is no obvious shortcoming we shouldnt change it.

EDIT: Wrong thread, I was refering to the one in dotnet/runtime#17100 ... ValueTuple could use that extra speed.

redknightlois on 30 Nov 2016

@karelz, sorry. Is there any reason accepting the contributor invitation to dotnet requires you to have 2-factor authentication enabled for your GitHub account? I have to take out my phone and enter a code every time I log in from a new computer, which is time-consuming because I hop between a lot of different computers.

jamesqo on 12 Dec 2016

@jamesqo without being contributor, GitHub won't let us assign work items to you unless you created them :(

karelz on 12 Dec 2016

@karelz I understand that. However, is there any reason that to be a contributor you have to enable 2-factor authentication for GitHub?

capture

jamesqo on 12 Dec 2016

karelz on 12 Dec 2016

👍2

The TFA seems to last for a while. At least on my work computer I can't remember the last time I had to authenticate again.

danmosemsft on 16 Dec 2016

@danmosemsft JamesKo is hopping between computers: "because I hop between a lot of different computers."

karelz on 17 Dec 2016

I think more people need to be introduced to 2FA and see how easy and indispensable it really is.

jnm2 on 28 Dec 2016

@karelz I have logged into GitHub on many of the computers at my library by now, so while 2FA is still an inconvenience it has been mitigated for me. Regardless, however, this issue can be closed since it was fixed in dotnet/corefx#14141 (@jkotas you're the owner).

jamesqo on 28 Dec 2016

Was this page helpful?

0 / 5 - 0 ratings