Runtime: Use more robust hash code combiner in ValueTuple

Created on 28 Nov 2016  路  29Comments  路  Source: dotnet/runtime

  • Use random seed to randomize the hashcodes
  • Use 64-bit state to combine the hashcodes (e.g. use the marvin32 hashcode combiner that is used for string hashing in coreclr)

Note: string.GetHashCode has the right implementation.

area-System.Runtime enhancement up-for-grabs

Most helpful comment

It's general dotnet org policy (all MS orgs / repos have the same policy). I think overall 2FA is useful -- people who switch computers often are IMO more an exception these days. (the only scenario I can imagine is students ...)

Each computer switch requires you to clone your repo locally - that takes time. Is the 2FA really that much time-consuming compared to cloning? I would guess it is just minor addition to the already-time-consuming process you have to go through. Or did I misunderstand your scenario/usage pattern?

All 29 comments

(forked from https://github.com/dotnet/corefx/issues/8034)

cc @VSadov @jcouv

@jkotas How would we get the random seed if we were to do this? Just use Environment.TickCount or something similar?

For reference to everyone, I found a readable implementation of Marvin32 here:

https://github.com/floodyberry/Marvin32/blob/master/Marvin32.c

Environment.TickCount is not sufficient source of randomness. Guid.NewGuid() should be good enough.

We will need this in 1.2 to avoid similar mistakes we did with string in the past.

@gafter pointed out a layering concern with using Guid.NewGuid(), as tuple is at a lower layer than guid. Presumably, we could refactor Guid to store its state in a tuple.
Would new object().GetHashCode() be a workable alternative?

as tuple is at a lower layer than guid

Nope. ValueTuple is slightly higher implementation layer than Guid. Guid is in System.Runtime contract. ValueTuple implementation depends on System.Runtime, as well as System.Collections, System.Diagnostics.Debug and System.Resources.ResourceManager.

I have concerns that randomizing the hash function too often would be breaking.

If we have, for example, different hash functions in different appdomains, a trivial serializing/deserializing of Dictionary<K,V> between appdomains will make the instance completely broken.

Basically having different hash functions within the same process seems bad.

I am not even sure about "same machine" case. There could be some distributed systems (caches or whatever), that might rely on stable hashing.
Considering that System.Object will not keep its hash in multiprocess environment it might be ok for ValueTuple too. Different hashing within same process would be strange though.

a trivial serializing/deserializing of Dictionary between appdomains will make the instance completely broken

This is not true. The hashcodes between appdomains are often different today already. (And .NET Core does not have appdomains but that's a secondary point.)

There could be some distributed systems (caches or whatever), that might rely on stable hashing.

Such systems need to use their own stable hash functions. As mentioned in dotnet/runtime#17100, string hashcodes are randomized by default in .NET Core. It is done on purpose to mitigate attacks that exploit predictable hash collisions. This issue is about extending this mitigation to ValueTuple since it is likely going to be a popular type for hashtable keys.

@jkotas, @VSadov If we decide to make these changes, would it be OK if I submitted a PR for this?

It's yours - I will reassign it to you once you accept the contributor invitation (limitation of GitHub when you are not the one who file d the issue :()

@karelz Sure, I accepted the invitation.

All yours now.

@VSadov suggested that marvin32 is overkill for combining just 2 or 3 hashcodes. I tend to agree.

@jkotas Ok. If we do not use Marvin32 for tuples of arity 2 and 3 however, I do not think we should use it for the other tuple types. We should keep the hashing algorithm consistent. Are there any other good hashing algorithms that accept 4 bytes at a time?

Also since this is related to security, I should mention a couple of posts I read the other day which describe how it's possible to cause collisions in many popular non-crypto hashes: here and here. Pretty interesting/scary to read; they essentially 'invert' hashes so, when it is fed back into the algorithm, that same hash is produced.

Are there any other good hashing algorithms that accept 4 bytes at a time?

@jamesqo Hmm, I am not sure actually. All algorithms I can think of would do bit mixing just like Marvin32. You can argue whether it is ok to omit some of the barel-shifts, but there is really no good answer for it.

@jkotas - the arguments for per-appdomain sound convincing.
We can always start randomizing at bigger granularity if there is too much apparent breakage.

As for beefing up hash combiner as a security measure - I am not sure this is the most effective use of cycles.

If elements have bad hash functions in a first place (say they use just two values), no matter how much you rotate and mix them you will not increase the result range.
If elements have good hash functions - then combining via XOR is fast and efficient. The only problem with XOR is that it has obvious funnels (tuple with 2 equal elements would collapse into 0 hash and such)

I think the goal for the combiner is to keep as much entropy from the sources, not have trivial funnels, and be fast.

@jamesqo Thanks for the help.
From last discussion with @jkotas and @VSadov, we're ok to move forward with randomization/seeding, but would rather hold back on adopting a more expensive hash function.
Doing the randomization keeps the door to changing the hash function in the future if the need arises.

@jcouv OK, sounds like a good plan.

@jkotas If ValueTuple will continue to use the same hash, what is the plan for HashCode (#8034)? Do you still want to use Marvin32 there? Or can we take the same approach and stick with the current algorithm, but insert a random seed at the beginning?

@jcouv seed randomization per domain sounds good enough.
@jamesqo @jkotas the current algorithm has good mixing properties, it is very fast and easily inlineable by the JIT, if there is no obvious shortcoming we shouldnt change it.

EDIT: Wrong thread, I was refering to the one in dotnet/runtime#17100 ... ValueTuple could use that extra speed.

@karelz, sorry. Is there any reason accepting the contributor invitation to dotnet requires you to have 2-factor authentication enabled for your GitHub account? I have to take out my phone and enter a code every time I log in from a new computer, which is time-consuming because I hop between a lot of different computers.

@jamesqo without being contributor, GitHub won't let us assign work items to you unless you created them :(

@karelz I understand that. However, is there any reason that to be a contributor you have to enable 2-factor authentication for GitHub?

capture

It's general dotnet org policy (all MS orgs / repos have the same policy). I think overall 2FA is useful -- people who switch computers often are IMO more an exception these days. (the only scenario I can imagine is students ...)

Each computer switch requires you to clone your repo locally - that takes time. Is the 2FA really that much time-consuming compared to cloning? I would guess it is just minor addition to the already-time-consuming process you have to go through. Or did I misunderstand your scenario/usage pattern?

The TFA seems to last for a while. At least on my work computer I can't remember the last time I had to authenticate again.

@danmosemsft JamesKo is hopping between computers: "because I hop between a lot of different computers."

I think more people need to be introduced to 2FA and see how easy and indispensable it really is.

@karelz I have logged into GitHub on many of the computers at my library by now, so while 2FA is still an inconvenience it has been mitigated for me. Regardless, however, this issue can be closed since it was fixed in dotnet/corefx#14141 (@jkotas you're the owner).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

omajid picture omajid  路  3Comments

aggieben picture aggieben  路  3Comments

GitAntoinee picture GitAntoinee  路  3Comments

jzabroski picture jzabroski  路  3Comments

chunseoklee picture chunseoklee  路  3Comments