Runtime: Proposal: Add System.HashCode to make it easier to generate good hash codes.

Created on 9 Dec 2016  路  182Comments  路  Source: dotnet/runtime

Update 6/16/17: Looking for volunteers

The API shape has been finalized. However, we're still deciding on the best hash algorithm out of a list of candidates to use for the implementation, and we need someone to help us measure the throughput/distribution of each algorithm. If you'd like to take that role up, please leave a comment below and @karelz will assign this issue to you.

Update 6/13/17: Proposal accepted!

Here's the API that was approved by @terrajobst at https://github.com/dotnet/corefx/issues/14354#issuecomment-308190321:

// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core      : System.Runtime / System.Private.CoreLib
namespace System
{
    public struct HashCode
    {
        public static int Combine<T1>(T1 value1);
        public static int Combine<T1, T2>(T1 value1, T2 value2);
        public static int Combine<T1, T2, T3>(T1 value1, T2 value2, T3 value3);
        public static int Combine<T1, T2, T3, T4>(T1 value1, T2 value2, T3 value3, T4 value4);
        public static int Combine<T1, T2, T3, T4, T5>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
        public static int Combine<T1, T2, T3, T4, T5, T6>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7, T8>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

        public void Add<T>(T value);
        public void Add<T>(T value, IEqualityComparer<T> comparer);

        [Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
        [EditorBrowsable(Never)]
        public override int GetHashCode();

        public int ToHashCode();
    }
}

The original text of this proposal follows.

Rationale

Generating a good hash code should not require use of ugly magic constants and bit twiddling on our code. It should be less tempting to write a bad-but-concise GetHashCode implementation such as

class Person
{
    public override int GetHashCode() => FirstName.GetHashCode() + LastName.GetHashCode();
}

Proposal

We should add a HashCode type to enscapulate hash code creation and avoid forcing devs to get mixed up in the messy details. Here is my proposal, which is based off of https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329, with a few minor revisions.

// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core      : System.Runtime / System.Private.CoreLib
namespace System
{
    public struct HashCode
    {
        public static int Combine<T1>(T1 value1);
        public static int Combine<T1, T2>(T1 value1, T2 value2);
        public static int Combine<T1, T2, T3>(T1 value1, T2 value2, T3 value3);
        public static int Combine<T1, T2, T3, T4>(T1 value1, T2 value2, T3 value3, T4 value4);
        public static int Combine<T1, T2, T3, T4, T5>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
        public static int Combine<T1, T2, T3, T4, T5, T6>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
        public static int Combine<T1, T2, T3, T4, T5, T6, T7, T8>(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

        public void Add<T>(T value);
        public void Add<T>(T value, IEqualityComparer<T> comparer);
        public void AddRange<T>(T[] values);
        public void AddRange<T>(T[] values, int index, int count);
        public void AddRange<T>(T[] values, int index, int count, IEqualityComparer<T> comparer);

        [Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
        public override int GetHashCode();

        public int ToHashCode();
    }
}

Remarks

See @terrajobst's comment at https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329 for the goals of this API; all of his remarks are valid. I would like to point out these ones in particular, however:

  • The API does not need to produce a strong cryptographic hash
  • The API will provide "a" hash code, but not guarantee a particular hash code algorithm. This allows us to use a different algorithm later or use different algorithms on different architectures.
  • The API will guarantee that within a given process the same values will yield the same hash code. Different instances of the same app will likely produce different hash codes due to randomization. This allows us to ensure that consumers cannot persist hash values and accidentally rely on them being stable across runs (or worse, versions of the platform).
api-approved area-System.Numerics up-for-grabs

Most helpful comment

Decisions

  • We should remove all AddRange methods because the scenario is unclear. Array's are somewhat unlikely to show up very often. And once larger arrays are involved, the question is whether the computation should be cached. Seeing the for loop on the calling side makes it clear that you need to think about that.
  • We also don't want to add IEnumerable overloads to AddRange because they would allocated.
  • We don't think we need the overload to Add that takes string and StringComparison. Yes, those are likely more efficient than calling through the IEqualityComparer, but we can fix this later.
  • We think marking GetHashCode as obsoltete with error is a good idea, but we'd go a step further and also hide from IntelliSense.

This leaves us with:

```C#
// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core : System.Runtime / System.Private.CoreLib
namespace System
{
public struct HashCode
{
public static int Combine(T1 value1);
public static int Combine(T1 value1, T2 value2);
public static int Combine(T1 value1, T2 value2, T3 value3);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

    public void Add<T>(T value);
    public void Add<T>(T value, IEqualityComparer<T> comparer);

    [Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
    [EditorBrowsable(Never)]
    public override int GetHashCode();

    public int ToHashCode();
}

}
```

All 182 comments

Proposal: add hash randomization support

public static HashCode Randomized<T> { get; } // or CreateRandomized<T>
or 
public static HashCode Randomized(Type type); // or CreateRandomized(Type type)

T or Type type is needed to get the same randomized hash for the same type.

Proposal: add support for collections

public HashCode Combine<T>(T[] values);
public HashCode Combine<T>(T[] values, IEqualityComparer<T> comparer);
public HashCode Combine<T>(Span<T> values);
public HashCode Combine<T>(Span<T> values, IEqualityComparer<T> comparer);
public HashCode Combine<T>(IEnumerable<T> values);
public HashCode Combine<T>(IEnumerable<T> IEqualityComparer<T> comparer);

I think there is no need in overloads Combine(_field1, _field2, _field3, _field4, _field5) because next code HashCode.Empty.Combine(_field1).Combine(_field2).Combine(_field3).Combine(_field4).Combine(_field5); should be inline optimized without Combine calls.

@AlexRadch

Proposal: add support for collections

Yes, that was part of my eventual plan for this proposal. I think it's important to focus on how we want the API to look like before we go about adding those methods, though.

He wanted to use a different algorithm, like the Marvin32 hash which is used for strings in coreclr. This would require expanding the size of HashCode to 8 bytes.

What about having Hash32 and Hash64 types that would internally store 4 or 8 bytes worth of data? Document the pros/cons of each. Hash64 being good for X, but being potentially slower. Hash32 being faster, but potentially not as distributed (or whatever the tradeoff actually is).

He wanted to randomize the hash seed, so hashes would not be deterministic.

This seems like useful behavior. But i could see people wanting to control this. So perhaps there should be two ways to create the Hash, one that takes no seed (and uses a random seed) and one that allows the seed to be provided.

Note: Roslyn would love if this could be provided in the Fx. We're adding a feature to spit out a GetHashCode for the user. Currently, it generates code like:

c# public override int GetHashCode() { var hashCode = -1923861349; hashCode = hashCode * -1521134295 + this.b.GetHashCode(); hashCode = hashCode * -1521134295 + this.i.GetHashCode(); hashCode = hashCode * -1521134295 + EqualityComparer<string>.Default.GetHashCode(this.s); return hashCode; }

This is not a great experience, and it exposes many ugly concepts. We would be thrilled to have a Hash.Whatever API that we could call through instead.

Thanks!

What about MurmurHash? It is reasonably fast and has very good hashing properties. There is also two different implementations, one that spits out 32-bit hashes and another that spits out 128-bit hashes.

There is also vectorized implementations for both the 32-bit.and 128-bit formats.

@tannergooding MurmurHash is fast, but not secure, from the sounds of this blog post.

@jkotas, has there been any work in the JIT around generating better code for >4-byte structs on 32-bit since our discussions last year? Also, what do you think of @CyrusNajmabadi's proposal:

What about having Hash32 and Hash64 types that would internally store 4 or 8 bytes worth of data? Document the pros/cons of each. Hash64 being good for X, but being potentially slower. Hash32 being faster, but potentially not as distributed (or whatever the tradeoff actually is).

I still think this type would be very valuable to offer to developers and it would be great to have it in 2.0.

@jamesqo, I don't think this implementation needs to be cryptographically secure (that is the purpose of the explicit cryptographically hashing functions).

Also, that article applies to Murmur2. The issue has been resolved in the Murmur3 algorithm.

the JIT around generating better code for >4-byte structs on 32-bit since our discussions last year

I am not aware of any.

what do you think of @CyrusNajmabadi's proposal

The framework types should be simple choices that work well for 95%+ of cases. They may not be the fastest ones, but that's fine. Having you to choose between Hash32 and Hash64 is not a simple choice.

That's fine with me. But can we at least have a good-enough solution for those 95% cases? Right now there's nothing... :-/

hashCode = hashCode * -1521134295 + EqualityComparer.Default.GetHashCode(this.s);

@CyrusNajmabadi Why are you calling EqualityComparer here, and not just this.s.GetHashCode()?

For non-structs: so that we don't need to check for null.

This is close to what we generate for anonymous types behind the scenes as well. I optimize the case of known non-null values to generate code that would be more pleasing to users. But it would be nice to just have a built in API for this.

The call to EqualityComparer.Default.GetHashCode is like 10x+ more expensive than check for null... .

The call to EqualityComparer.Default.GetHashCode is like 10x+ more expensive than check for null..

Sounds like a problem. if only there were good hash code API we could call in the Fx that i could defer to :)

(also, we have that problem then in our anonymous types as that's what we generate there as well).

Not sure what we do for tuples, but i'm guessing it's similar.

Not sure what we do for tuples, but i'm guessing it's similar.

System.Tuple goes through EqualityComparer<Object>.Default for historic reasons. System.ValueTuple calls Object.GetHashCode with null check - https://github.com/dotnet/coreclr/blob/master/src/mscorlib/shared/System/ValueTuple.cs#L809.

Oh no. Looks like tuple can just use "HashHelpers". Could that be exposed so that users can get the same benefit?

Great. I'm happy to do something similar. I started from our anonymous types because i figured they were reasonable best practices. If not, that's fine. :)

But that's not why i'm here. I'm here to get some system that actually combines the hashes effectively. If/when that can be provided we'll gladly move to calling into that instead of hardcoding in random numbers and combining hash values ourselves.

What would be the API shape that you think would work best for the compiler generated code?

Literally any of the 32bit solutions that were presented earlier would be fine with me. Heck, 64bit solutions are fine with me. Just some sort of API that you can get that says "i can combine hashes in some sort of reasonable fashion and produce a reasonably distributed result".

I can't reconcile these statements:

We had an immutable HashCode struct that was 4 bytes in size. It had a Combine(int) method, which mixed in the provided hash code with its own hash code via a DJBX33X-like algorithm, and returned a new HashCode.

@jkotas did not think the DJBX33X-like algorithm was robust enough.

And

The framework types should be simple choices that work well for 95%+ of cases.

Can we not come up with a simple 32bit accumulating hash that works well enough for 95% of cases? What are the cases that aren't handled well here, and why do we think they're in the 95% case?

@jkotas, is performance really that critical for this type? I think on average things like hashtable lookups and this would take up way more time than a few struct copies. If it does turn out to be a bottleneck, would it be reasonable to ask the JIT team to optimize 32-bit struct copies after the API is released so they have some incentive, rather than blocking this API on that when nobody is working on optimizing copies?

Can we not come up with a simple 32bit accumulating hash that works well enough for 95% of cases?

We have been burnt really badly by default 32bit accumulating hash for strings, and that's why Marvin hash for strings in .NET Core - https://github.com/dotnet/corert/blob/87e58839d6629b5f90777f886a2f52d7a99c076f/src/System.Private.CoreLib/src/System/Marvin.cs#L25. I do not think we want to repeat same mistake here.

@jkotas, is performance really that critical for this type?

I do not think the performance is critical. Since it looks like that this API is going to be used by auto-generated compiler code, I think we should be preferring smaller generated code over how it looks. The non-fluent pattern is smaller code.

We have been burnt really badly by default 32bit accumulating hash for string

That doesn't seem like the 95% case. We're talking about normal developers just wanting a "good enough" hash for all those types where they manually do things today.

Since it looks like that this API is going to be used by auto-generated compiler code, I think we should be preferring smaller generated code over how it looks. The non-fluent pattern is smaller code.

This is not for use by the Roslyn compiler. This is for use by the Roslyn IDE when we help users generate GetHashCodes for their types. THis is code that the user will see and have to maintain, and having something sensible like:

```c#
return Hash.Combine(this.A?.GetHashCode() ?? 0,
this.B?.GetHashCode() ?? 0,
this.C?.GetHashCode() ?? 0);

is a lot nicer than a user seeing and having to maintain:

```c#
            var hashCode = -1923861349;
            hashCode = hashCode * -1521134295 + this.b.GetHashCode();
            hashCode = hashCode * -1521134295 + this.i.GetHashCode();
            hashCode = hashCode * -1521134295 + EqualityComparer<string>.Default.GetHashCode(this.s);
            return hashCode;

I mean, we already have this code in the Fx:

https://github.com/dotnet/roslyn/blob/master/src/Compilers/Test/Resources/Core/NetFX/ValueTuple/ValueTuple.cs#L5

We think it's good enough for tuples. It's unclear to me why it would be such a problem to make it available for users who want it for their own types.

Note: we've even considered doing this in roslyn:

c# return (this.A, this.B, this.C).GetHashCode();

But now you're forcing people to generate a (potentially large) struct just to get some sort of reasonable default hashing behavior.

We're talking about normal developers just wanting a "good enough" hash for all those types where they manually do things today.

The original string hash was a "good enough" hash that worked well for normal developers. But then it was discovered that ASP.NET webservers were vulnerable to DoS attacks because they tend to store received stuff in hashtables. So the "good enough" hash basically turned into a bad security issue.

We think it's good enough for tuples

No necessarily. We made a back stop measure for tuples to make the hashcode randomized that gives us option to modify the algorithm later.

     return Hash.Combine(this.A?.GetHashCode() ?? 0,
                         this.B?.GetHashCode() ?? 0,
                         this.C?.GetHashCode() ?? 0);

This looks reasonable to me.

I don't get your positoin. You seem to be saying two things:

The original string hash was a "good enough" hash that worked well for normal developers. But then it was discovered that ASP.NET webservers were vulnerable to DoS attacks because they tend to store received stuff in hashtables. So the "good enough" hash basically turned into a bad security issue.

Ok, if that's the case, then let's provide a hash code that's good for people who have security/DoS concerns.

The framework types should be simple choices that work well for 95%+ of cases.

Ok, if that's the case, then let's provide a hash code that's good enough for the 95% of cases. People who have security/DoS concerns can use the specialized forms that are documented for that purpose.

No necessarily. We made a back stop measure for tuples to make the hashcode randomized that gives us option to modify the algorithm later.

Ok. Can we expose that so that users can use that same mechanism.

--
I'm really struggling here because it sounds like we're saying "because we can't make a universal solution, everyone has to roll their own". That seems like one of hte worst places to be in. Because certainly most of our customers aren't thinking about rolling their own 'marvin hash' for DoS concerns. They're just adding, xoring, or otherwise poorly combining field hashes into one final hash.

If we care about the 95% case, then we should just make a generally good enogh hash. IF we care about the 5% case, we can supply a specialized solution for that.

This looks reasonable to me.

Great :) Can we then expose:

```c#
namespace System.Numerics.Hashing
{
internal static class HashHelpers
{
public static readonly int RandomSeed = new Random().Next(Int32.MinValue, Int32.MaxValue);

    public static int Combine(int h1, int h2)
    {
        // RyuJIT optimizes this to use the ROL instruction
        // Related GitHub pull request: dotnet/coreclr#1830
        uint rol5 = ((uint)h1 << 5) | ((uint)h1 >> 27);
        return ((int)rol5 + h1) ^ h2;
    }
}
Roslyn could then generate:

```c#
     return Hash.Combine(Hash.RandomSeed,
                         this.A?.GetHashCode() ?? 0,
                         this.B?.GetHashCode() ?? 0,
                         this.C?.GetHashCode() ?? 0);

This would have the benefit of really being "good enough" for the vast majority of cases, while also leading people down the good path of initializing with random values so they don't take dependencies on non-random hashes.

People who have security/DoS concerns can use the specialized forms that are documented for that purpose.

Every ASP.NET app has security/DoS concern.

Great :) Can we then expose:

This is different from what I have said is reasonable.

What do you think about https://github.com/aspnet/Common/blob/dev/shared/Microsoft.Extensions.HashCodeCombiner.Sources/HashCodeCombiner.cs . It is what is used in ASP.NET internally in number of places today, and it is what I would be pretty happy with (except that the combining function needs to be stronger - implementation detail that we can keep tweaking).

@jkotas I heard that :p

So the problem here is developers don't know when they're susceptible to DoS attacks, because it's not something they thing about it, which is why we switched strings to use Marvin32.

We should not head down the route of saying "95% of the cases don't matter", because we have no way to prove that, and we must err on the side of caution even when it has a performance cost. If you're going to move away from that then the hash code implementation needs Crypto Board review, not just us deciding "This looks good enough".

Every ASP.NET app has security/DoS concern.

Ok. So how are you dealing with teh issue today that no one has any help with hashcodes, and thus is likely doing things poorly? Clearly it's been acceptable to have that state of the world. So what is harmed by providing a reasonable hashing system that likely performs better than what people are hand rolling today?

because we have no way to prove that, and we must err on the side of caution even when it has a performance cost

If you don't provide something, people will continue to just do things badly. The rejection of the "good enough" because there's nothing perfect just means the poor status quo we have today.

Every ASP.NET app has security/DoS concern.

Can you explain this? As i understand it, you have a DoS concern if you're accepting arbitrary input and then storing it in some data structure that performs poorly if the inputs can be specially crafted. Ok, i get how that's a concern with the strings one gets in web scenarios that have come from the user.

So how does that apply to the remainder of types out there that are not being used in this scenario?

We have these sets of types:

  1. User types that need to be DoS safe. Right now we don't supply anything to help out, so we're already in a bad place as people are likely not doing the right thing.
  2. User types that don't need to be DoS safe. Right now we don't supply anything to help out, so we're already in a bad place as people are likely not doing the right thing.
  3. Framework types that need to be DoS safe. Right now we've made them DoS safe, but through APIs we don't expose.
  4. Framework tyeps that don't need to be DoS safe. Right now we've given them hashes, but through APIs we don't expose.

Basically, we think these cases are important, but not important enough to actually provide a solution to users to handle '1' or '2'. Because we're worried a solution for '2' won't be good for '1' we won't even provide it in the first place. And if we're not willing to even provide a solution for '1' it feels like we're in an incredibly strange position. We're worried about DoSing and ASP, but not worried enogh to actually help people. And because we won't help people with that, we're not even willing to help then with the non-DoS cases.

--

If these two cases are important (which i'm willing to accept) then why not just give two APIs? Document them. Make them clear what they're for. If people use them properly, great. If people don't use them properly that's still fine. After all, they're likely not doing things properly today anyways, so how are things any worse?

What do you think about

I have no opinion one way or the other. If it's an API that customers can use which performs acceptably and which provides a simple API with clear code on their end, then i think that's fine.

I think it would be nice to have a simple static form that handles the 99% case of wanting to combine a set of fields/properties in an ordered fashion. It seems like such a thing could be added to this type fairly simply.

I think it would be nice to have a simple static form

Agree.

I think it would be nice to have a simple static form that handles the 99% case of wanting to combine a set of fields/properties in an ordered fashion. It seems like such a thing could be added to this type fairly simply.

Agree.

I am willing to meet you both halfway on this one because I really want to see some sort of API come through. @jkotas I still do not understand you're opposed to adding a immutable instance-based API; first you said it was because 32-bit copies would be slow, then because the mutable API would be more terse (which is not true; h.Combine(a).Combine(b) (immutable version) is shorter than h.Combine(a); h.Combine(b); (mutable version)).

That said, I'm willing to go back to:

public static class HashCode
{
    public static int Combine<T>(T value1, Tvalue2);
    public static int Combine<T>(T value1, Tvalue2, IEqualityComparer<T> comparer);
    public static int Combine<T>(T value1, Tvalue2, T value3);
    public static int Combine<T>(T value1, Tvalue2, T value3, IEqualityComparer<T> comparer);
    public static int Combine<T>(T value1, Tvalue2, T value3, T value4);
    public static int Combine<T>(T value1, Tvalue2, T value3, T value4, IEqualityComparer<T> comparer);
    // ... All the way until value8
}

Does this seem reasonable?

I can't edit my post right now, but I just realized not all methods can accept T. In that case, we can just have 8 overloads accepting all ints and force the user to call GetHashCode.

If these two cases are important (which i'm willing to accept) then why not just give two APIs? Document them. Make them clear what they're for. If people use them properly, great. If people don't use them properly that's still fine. After all, they're likely not doing things properly today anyways, so how are things any worse?

Because people don't use things properly when they're there. Let's take a simple example, XSS. From the beginning even web forms had the ability to HTML encode output. However developers didn't know the risk, didn't know how to do it properly, and only found out when it was too late, their app was published, and oops, now their auth cookie has been lifted.

Giving people a security choice assumes they

  1. Know about the problem.
  2. Understand what the risks are.
  3. Can evaluate those risks.
  4. Can easily discover the right thing to do.

Those assumptions don't generally hold for the majority of developers, they only find out about the problem when it's too late. Developers don't go to security conferences, they don't read white papers and they don't understand the solutions. So in the ASP.NET HashDoS scenario we made the choice for them, we protected them by default, because that was the right thing to do, and had the greatest impact. However we only applied it to strings, and that left people who were constructing custom classes from user input in a bad place. We should do the right thing, and help protect those customers now, and make it the default, having a pit of success, not failure. API design for security is sometimes not about choice, but about helping the user whether they know it or not.

A user can always create a non-security focused hash; so given the two options

  1. Default hash utility is non-security aware; user can create a security aware hash function
  2. Default hash utility is security aware; user can create a custom non-security aware hash function

Then the second is probably better; and what's suggested wouldn't have the perf impact of a full on crypto hash; so it makes a good compromise?

One of the running questions in these threads has been which algorithm is perfect for everybody. I think it's safe to say there isn't a single perfect algorithm. However, I don't think that should stop us from providing something better than code like what @CyrusNajmabadi has shown, which tends to have poor entropy for common .NET inputs as well as other common hasher bugs (like losing input data or being easily resettable).

I'd like to propose a couple of options to get around the "best algorithm" problem:

  1. Explicit Choices: I'm planning to send out an API proposal soonish for a suite of non-cryptographic hashes (perhaps xxHash, Marvin32, and SpookyHash for example). Such an API has slightly different usage than a HashCode or HashCodeHelper type, but for the sake of discussion, assume we can work out those differences. If we use that API for GetHashCode:

    • The generated code is explicit about what it's doing -- if Roslyn generates Marvin32.Create();, it lets power users know what it decided to do and they can easily change it to another algorithm in the suite if they like.

    • It means we don't have to worry about breaking changes. If we start with a non-randomizing/poor entropy/slow algorithm, we can simply update Roslyn to start generating something else in new code. Old code will keep using the old hash and new code will use the new hash. Developers (or a Roslyn code fix) can change the old code if they want to.

    • The biggest downside I can think of is that some of the optimizations we might want for GetHashCode could be detrimental for other algorithms. For example, while a 32-bit internal state works nicely with immutable structs, a 256-bit internal state in (say) CityHash might waste a bunch of time copying.

  1. Randomization: Start with a properly randomized algorithm (the code @CyrusNajmabadi showed with a random initial value doesn't count since it's likely possible to wash out the randomness). This ensures that we can change the implementation with no compatibility issues. We would still need to be very sensitive about performance changes if we change the algorithm. However that would also be a potential upside as we could make per-architecture (or even per-device) choices. For example, this site shows that xxHash is fastest on an x64 Mac while SpookyHash is fastest on Xbox and iPhone. If we do go down this route with an intent to change algorithms at some point, we may need to think about designing an API that still has reasonable performance if there is 64+ bit internal state.

CC @bartonjs, @terrajobst

@morganbr There isn't a single perfect algorithm, but I think that having some algorithm, which works fairly well most of the time, exposed using a simple, easy to understand API is the most useful thing that can be done. Having a suite of algorithms in addition to that, for advanced uses is fine. But it shouldn't be the only option, I shouldn't have to learn who Marvin is just so that I can put my objects into a Dictionary.

I shouldn't have to learn who Marvin is just so that I can put my objects into a Dictionary.

I like the way you put that. I also like that you mentioned Dictionary itself. IDictionary is something that can have tons of different impls with all sorts of differing qualities (see the collections APIs in many platforms). However, we still just provide a base 'Dictionary' that does a decent job overall, even though it may not excel in every category.

I think that's what a ton of people are looking for in a hashing library. Something that gets the job done, even if it is not perfect for every purpose.

@morganbr I think people simple want a way to write GetHashCode that is better than what they're doing today (usually some grabbag combination of math operations they copied from something on the web). If you can just provide a basic impl of that that runes well, then people will be happy. You can then have a behind-the-scenes API for advanced users if they have a strong need for specific hashing functions.

In other words, people writing hashcodes today aren't going to know or care why they would want Spooky vs Marvin vs Murmur. Only someone who has a particular need for one of those specific hash codes would go looking. But lots of people have a need to say "here's the state of my object, provide me a way to produce a well distributed hash that is fast that i can then use with dictionaries, and which i guess prevents me from being DOSed if i happen to take untrusted input and hash it and store it".

@CyrusNajmabadi The problem is that if we extend our current notions of compatibility into the future we find that once this type ships it can't ever change (unless we find that the algorithm is horribly broken in an "it makes all applications attackable" manner).

Once can argue that if it starts off as a stable-randomized manner that it becomes easy to change the implementation, since you couldn't depend on the value from run to run anyways. But if a couple of years later we find that there's an algorithm that provides as-good-if-not-better balancing of hash buckets with better-in-the-general case performance, but makes a structure involving a List\

Under Morgan's suggestion is that the code that you write today will have effectively the same performance characteristics forever. For the applications which could have gotten better, this is unfortunate. For the applications which would have gotten worse, this is fantastic. But when we find the new algorithm we get it checked in, and we change Roslyn (and suggest a change to ReSharper/etc) to start generating things with NewAwesomeThing2019 instead of SomeThingThatWasConsideredAwesomeIn2018.

Anything super black box like this only ever gets to be done once. And then we're stuck with it forever. Then someone writes the next one, which has better average performance, so there are two black box implementations that you don't know why you'd choose between them. And then... and then....

So, sure, you may not know why Roslyn/ReSharper/etc auto-wrote GetHashCode for you using Marvin32, or Murmur, or FastHash, or a combination/conditional based on IntPtr.Size. But you have the power to look into it. And you have the power to change it on your types later, as new information is revealed... but we've also given you the power to keep it the same. (It'd be sad if we write this, and in 3 years Roslyn/ReSharper/etc are explicitly avoiding calling it, because the new algorithm is So Much Better... Usually).

@bartonjs What makes hashing different from all the places where .Net provides you with black box algorithm or data structure? For example, sorting (introsort), Dictionary (array-based separate chaining), StringBuilder (linked list of 8k chunks), most of LINQ.

We've taken a deeper look at this today. Apologies for the delay and the back and forth on this issue.

Requirements

  • Who is the API for?

    • The API does not need to produce a strong cryptographic hash

    • But: the API needs to be good enough so that we can use it in the framework itself (e.g. in the BCL and ASP.NET)

    • However, this doesn't mean that we have to use the API everywhere. It's OK if there are parts of the FX where we want to use a custom one either for security/DOS risks or because of performance. Exceptions will always exist.

  • What's the desired properties of this hash?

    • All bits in the input are used

    • The result is well distributed

    • The API will provide "a" hash code, but not guarantee a particular hash code algorithm. This allows us to use a different algorithm later or use different algorithms on different architectures.

    • The API will guarantee that within a given process the same values will yield the same hash code. Different instances of the same app will likely produce different hash codes due to randomization. This allows us to ensure that consumers cannot persist hash values and accidentally rely on them being stable across runs (or worse, versions of the platform).

API Shape

```C#
// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core : System.Runtime / System.Private.CoreLib
namespace System
{
public struct HashCode
{
public static int Combine(T1 value1);
public static int Combine(T1 value1, T2 value2);
public static int Combine(T1 value1, T2 value2, T3 value3);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

    public void Add<T>(T value);
    public void Add<T>(T value, IEqualityComparer<T> comparer);
    public void Add<T>(T[] value);
    public void Add<T>(T[] value, int index, int length);
    public void Add(byte[] value);
    public void Add(byte[] value, int index, int length);
    public void Add(string value);
    public void Add(string value, StringComparison comparisonType);

    public int ToHashCode();
}

}

Notes:

* We decided to not override `GetHashCode()` to produce the hash code as this would be weird, both naming-wise as well as from a behavioral standpoint (`GetHashCode()` should return the object's hash code, not the one being computed).
* We decided to use `Add` for the builder patter and `Combine` for the static construction
* We decided to use not provide a static initialization method. Instead, `Add` will do this on first use.
* The struct is mutable, which is unfortunate but we feel the best compromise between making `GetHashCode()` very cheap & not cause any allocations while allowing the structure to be bigger than 32-bit so that the hash code algorithm can use more bits during accumulation.
* `Combine` will just call `<value>.GetHashCode()`, so it has the behavior of the value's type `GetHashCode()` implementation
    - For strings that means different casing will produce different hash codes
    - For arrays, that means the hash code doesn't look at the contents but uses reference semantics for the hash code
    - If that behavior is undesired, the developer needs to use the builder-style approach

### Usage

The simple case is when someone just wants to produce a good hash code for a given type, like so:

```C#
public class Customer
{
    public int Id { get; set; }
    public string FirstName { get; set; }
    public string LastName { get; set; }

    public override int GetHashCode() => HashCode.Combine(Id, FirstName, LastName);
}

The more complicated case is when the developer needs to tweak how the hash is being computed. The idea is that the call site passes the desired hash rather then the object/value, like so:

```C#
public partial class Customer
{
public override int GetHashCode() =>
HashCode.Combine(
Id,
StringComparer.OrdinalIgnoreCase.GetHashCode(FirstName),
StringComparer.OrdinalIgnoreCase.GetHashCode(LastName),
);
}

And lastly, if the developer needs more flexibility, such as producing a hash code for more than eight values, we also provide a builder-style approach:

```C#
public partial class Customer
{
    public override int GetHashCode()
    {
        var hashCode = new HashCode();
        hashCode.Add(Id);
        hashCode.Add(FirstName, StringComparison.OrdinalIgnoreCase);
        hashCode.Add(LastName, StringComparison.OrdinalIgnoreCase);
        return hashCode.ToHashCode();
    }
}

Next Steps

This issue will remain up for grabs. In order to implement the API we need to decide which algorithm to use.

@morganbr will make a proposal for good candidates. Generally speaking, we don't want to write a hashing algorithm from scratch -- we want to use a well-known one whose properties are well-understood.

However, we should measure the implementation for typical .NET workloads and see which algorithm produces good results (throughput and distribution). It's likely that the answers will differ by CPU architecture, so we should consider this when measuring.

@jamesqo, are you still interested on working in this area? In that case, please update the proposal accordingly.

@terrajobst , we might also want public static int Combine<T1>(T1 value);. I know it looks a little funny, but it would provide a way of diffusing bits from something with a limited input hash space. For example, many enums only have a few possible hashes, only using the bottom few bits of the code. Some collections are built on the assumption that hashes are spread over a larger space, so diffusing the bits may help the collection work more efficiently.

public void Add(string value, StrinComparison comparison);

Nit: The StringComparison parameter should be named comparisonType to match the naming used everywhere else StringComparison is used as a parameter.

The criteria that would help us choose algorithms would be:

  1. Does the algorithm have a good avalanche effect? That is, does every bit of input have a 50% chance of flipping every bit of output? This site has a study of several popular algorithms.
  2. Is the algorithm fast for small inputs? Since HashCode.Combine will generally handle 8 or fewer ints, startup time may be more important than throughput. This site has an interesting set of data to start with. This is also where we may need different answers for different architectures or other pivots (OS, AoT vs JIT, etc).

What we'd really like to see is performance numbers for candidates written in C# so that we can be reasonably confident their characteristics will hold up for .NET. If you write a candidate and we don't pick it for this, that will still be useful work whenever I actually get the API proposal together for the non-cryptographic hash API.

Here are some candidates I think are worth evaluating (but feel free to propose others):

  • Marvin32 (we already have a C# implementation here). We know it's fast enough for String.GetHashCode and we believe it's HashDoS resistant
  • xxHash32 (Fastest on algorithm on x86 here that has top quality according to SMHasher)
  • FarmHash (Fastest on x64 here. I haven't found a good indicator of quality for it. This one might be hard to write in C# though)
  • xxHash64 (truncated to 32 bits) (This isn't a clear speed winner, but might be easy to do if we already have xxHash32)
  • SpookyHash (Tends to do well on larger data sets)

Shame the Add methods can't have a return type of ref HashCode and return ref this so they can be used in a fluent manner,

Would readonly ref returns allow this? /cc @jaredpar @VSadov

WARNING: If anyone picks hash implementation from existing code base somewhere on the internet, please keep the link to the source and check the license (we will have to do it as well).

If the license is not compatible, we may need to write the algorithm from scratch.

IMO, using the Add methods should be exceedingly uncommon. It will be for very advanced scenarios, and the need to be able to be 'fluent' will not really be there.

For the common use cases for 99% of all user code cases, one should be able to juse use => HashCode.Combine(...) and be fine.

@morganbr

we might also want public static int Combine<T1>(T1 value);. I know it looks a little funny, but it would provide a way of diffusing bits from something with a limited input hash space

Make sense. I've added it.

@justinvp

Nit: The StringComparison parameter should be named comparisonType to match the naming used everywhere else StringComparison is used as a parameter.

Fixed.

@CyrusNajmabadi

IMO, using the Add methods should be exceedingly uncommon. It will be for very advanced scenarios, and the need to be able to be 'fluent' will not really be there.

Agreed.

@benaadams - re: ref returning this from Add - no, this cannot be returned by ref in struct methods since it can be an rValue or a temp.

```C#
ref var r = (new T()).ReturnsRefThis();

// r refers to some variable here. Which one? What is the scope/lifetime?
r = SomethingElse();
```

In case it's useful for comparison purposes, some years ago I ported the Jenkins lookup3 hash function (C source) to C# here.

I'm wondering about collections:

@terrajobst

c# public void Add<T>(T[] value);

Why is there an overload for arrays, but not one for general collections (i.e. IEnumerable<T>)?

Also, isn't it going to be confusing that HashCode.Combine(array) and hashCode.Add((object)array) behave one way (use reference equality) and hashCode.Add(array) behaves another way (combines hash codes of the values in the array)?

@CyrusNajmabadi

For the common use cases for 99% of all user code cases, one should be able to just use => HashCode.Combine(...) and be fine.

If the aim is really to be able to use Combine in 99 % of use cases (and not, say, 80 %), then shouldn't Combine somehow support hashing collections based on the values in the collection? Maybe there should be a separate method that does that (either an extension method or static method on HashCode)?

If Add is a power scenario, should we assume the user should choose between Object.GetHashCode and combining individual elements of collections? If it would help, we could consider renaming the array (and potential IEnumerable) versions. Something like:
c# public void AddEnumerableHashes<T>(IEnumerable<T> enumerable); public void AddEnumerableHashes<T>(T[] array); public void AddEnumerableHashes<T>(T[] array, int index, int length);
I wonder whether we'd also need overloads with IEqualityComparers.

Proposal: Make the builder struct implement IEnumerable to support collection initializer syntax:

C# return new HashCode { SomeField, OtherField, { SomeString, StringComparer.UTF8 }, { SomeHashSet, HashSet<int>.CreateSetComparer() } }.GetHashCode()

This is much more elegant than calling Add() by hand (in particular, you don't need a temporary variable), and still has no allocations.

more details

@SLaks Maybe that nicer syntax could wait for https://github.com/dotnet/csharplang/issues/455 (assuming that proposal had support), so that HashCode wouldn't have to implement bogus IEnumerable?

We decided to not override GetHashCode() to produce the hash code as this would be weird, both naming-wise as well as from a behavioral standpoint (GetHashCode() should return the object's hash code, not the one being computed).

I find it strange that GetHashCode isn't going to return the computed hashcode. I think this is going to confuse developers. For example, @SLaks already used it in his proposal instead of using ToHashCode.

@justinvp If GetHashCode() is not going to return the computed hash code, it should probably be marked [Obsolete] and [EditorBrowsable(Never)].

On the other hand, I don't see the harm in it returning the computed hash code.

@terrajobst

We decided to not override GetHashCode() to produce the hash code as this would be weird, both naming-wise as well as from a behavioral standpoint (GetHashCode() should return the object's hash code, not the one being computed).

Yes, GetHashCode() should return the object's hash code, but is there any reason why the two hash codes should be different? It's still correct, since two instances of HashCode with the same internal state will return the same value from GetHashCode().

@terrajobst I just saw your comment. Forgive me for the delayed reply, I was slow to look into the notification because I thought it would only be more back-and-forth that was going nowhere. Glad to see that's not the case! :tada:

I'd be delighted to pick this up and do the throughput/distribution measuring (I assume that's what you meant by "interested in working on this area"). Give me a second to finish reading through all of the comments here, though.

@terrajobst

Can we change

public void Add<T>(T[] value);
public void Add<T>(T[] value, int index, int length);
public void Add(byte[] value);
public void Add(byte[] value, int index, int length);

to

public void AddRange<T>(T[] values);
public void AddRange<T>(T[] values, int index, int count);
public void AddRange<T>(T[] values, int index, int count, IEqualityComparer<T> comparer);

? I renamed Add -> AddRange to avoid the behavior @svick mentioned. I removed the byte overloads since we can specialize using typeof(T) == typeof(byte) inside the method if we need to do anything byte-specific. Also, I changed value -> values and length -> count. It also makes sense to have a comparer overload.

@terrajobst Can you remind me why

        public void Add(string value);
        public void Add(string value, StringComparison comparisonType);

is necessary when we have

        public void Add<T>(T value);
        public void Add<T>(T value, IEqualityComparer<T> comparer);

?

@svick

@justinvp If GetHashCode() is not going to return the computed hash code, it should probably be marked [Obsolete] and [EditorBrowsable(Never)].

:+1:

@terrajobst Can we go back to having an implicit conversion from HashCode -> int, so no ToHashCode method? edit: ToHashCode is fine. See @CyrusNajmabadi's response below.

@jamesqo StringComparison is an enum.
However, people could use the equivalent StringComparer instead.

Can we go back to having an implicit conversion from HashCode -> int, so no ToHashCode method?

We discussed this and decided against it in the meeting. The issue is that when the user gets the final 'int' that extra work is often done. i.e. the internals of the hashcode will often do a finalization step, and may reset itself to a fresh state. Having that happen with an implicit conversion would be strange. If you did this:

HashCode hc = ...

int i1 = hc;
int i2 = hc;

Then you could get different results.

For that reason, we also don't like the explicit conversion (as people don't think of conversions as changing internal state).

With a method we can document explicitly that this is happening. We can even potentially name it to convey it as much. i.e. "ToHashCodeAndReset" (though we decided against that). But at least the method can have clear documentation on it that hte user can see in things like intellisense. That's not really the case with conversions.

I removed the byte overloads since we can specialize using typeof(T) == typeof(byte)

IIRC there was some concern about this not being ok from the JIT perspective. But that may have only been for the non-value-type "typeof()" cases. As long as the jit will effectively do the right thing for the value-type typeof() cases, then that should be good.

@CyrusNajmabadi I was unaware that conversion to an int might involve mutating state. ToHashCode it is then.

For those thinking about the crypto perspective - http://tuprints.ulb.tu-darmstadt.de/2094/1/thesis.lehmann.pdf

@terrajobst, have you had time to read my comments (starting from here) and decide whether you approve of the tweaked API shape? If so, then I think this can be marked api-approved/up for grabs and we can start deciding on a hash algorithm.

@blowdart , any particular part of that you'd like to highlight?

I might not have been too explicit about it above, but the only non-cryptographic hashes I don't know of HashDoS breaks in are Marvin and SipHash. That is, even seeding (say) Murmur with a random value can still be broken and used for a DoS.

None, I just found it interesting, and I'm thinking the docs for this ought to say "Not for use on hash codes which are generated via cryptographic algorithms."

Decisions

  • We should remove all AddRange methods because the scenario is unclear. Array's are somewhat unlikely to show up very often. And once larger arrays are involved, the question is whether the computation should be cached. Seeing the for loop on the calling side makes it clear that you need to think about that.
  • We also don't want to add IEnumerable overloads to AddRange because they would allocated.
  • We don't think we need the overload to Add that takes string and StringComparison. Yes, those are likely more efficient than calling through the IEqualityComparer, but we can fix this later.
  • We think marking GetHashCode as obsoltete with error is a good idea, but we'd go a step further and also hide from IntelliSense.

This leaves us with:

```C#
// Will live in the core assembly
// .NET Framework : mscorlib
// .NET Core : System.Runtime / System.Private.CoreLib
namespace System
{
public struct HashCode
{
public static int Combine(T1 value1);
public static int Combine(T1 value1, T2 value2);
public static int Combine(T1 value1, T2 value2, T3 value3);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7);
public static int Combine(T1 value1, T2 value2, T3 value3, T4 value4, T5 value5, T6 value6, T7 value7, T8 value8);

    public void Add<T>(T value);
    public void Add<T>(T value, IEqualityComparer<T> comparer);

    [Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)]
    [EditorBrowsable(Never)]
    public override int GetHashCode();

    public int ToHashCode();
}

}
```

Next steps: The issue is up-for-grabs -- to implement the API we need with several candidate algorithms as experiments -- see https://github.com/dotnet/corefx/issues/14354#issuecomment-305028686 for list, so that we can decide which algorithm to take (based on throughput and distribution measurements, likely different answer per CPU architecture).

Complexity: Large

If anyone is interested in picking it up, please ping us. There might be even room for several people working on it together. (@jamesqo you have priority choice as you invested most & longest into the issue)

@karelz Despite my comment above, I have changed my mind because I do not think I have the qualifications to choose the best hash algorithm. I looked into some of the libraries @morganbr listed and realized the implementation is quite complex, so I can't easily translate it to C# to test for myself. I have little background in C++, so I would also have a hard time just installing the library and writing a test app.

I do not want this to remain on the up-for-grabs list forever, though. If no one takes it up a week from today, I will consider posting a question on Programmers SE or Reddit.

I've not benched it (or otherwise optimized it), but here is a basic implementation of the Murmur3 hash algorithm that I use in several of my personal projects: https://gist.github.com/tannergooding/0a12559d1a912068b9aeb4b9586aad7f

I feel like the most optimal solution here is going to be to dynamically change the hashing algorithm based on the size of the input data.

Ex: Mumur3 (and others) are very fast for large sets of data and provide great distribution, but they can perform 'poorly' (speed wise, not distribution wise) for smaller sets of data.

I imagine we should do something like: If overall byte count is less than X, do algorithm A; otherwise, do algorithm B. This will still be deterministic (per run), but will allow us to provide speed and distribution based on the actual size of the input data.

It's probably also worth noting that several of the algorithms mentioned have implementations specifically designed for SIMD instructions, so an the most performant solution would likely involve an FCALL at some level (like is done with some of the BufferCopy implementations) or may involve taking a dependency on System.Numerics.Vector.

@jamesqo, we're happy to help out with making choices; what we need the most help with is performance data for candidate implementations (ideally C#, though as @tannergooding points out, some algorithms need special compiler support). As I mentioned above, if you build a candidate that isn't chosen, we'll probably use it later, so don't worry about work being wasted.

I know there are benchmarks out there for various implementations, but I think it's important to have a comparison using this API and a likely range of inputs (e.g. structs with 1-10 fields).

@tannergooding, that kind of adaptability might be most performant, but I don't see how it would work with the Add method since it doesn't know how many times it'll get called. While we could do it with Combine, that would mean a series of Add calls could produce a different result than the corresponding Combine call.

Also, given that the most likely range of inputs is 4-32 bytes (Combine`1-Combine`8), hopefully there aren't big performance changes over that range.

that kind of adaptability might be most performant, but I don't see how it would work with the Add method since it doesn't know how many times it'll get called.

I'm not personally convinced the API shape is quite right for general purpose hashing (it is close however)...

Currently we are exposing Combine methods for static construction. If these are meant to combine all inputs and produce a finalized hash code, then the name is 'poor' and something like Compute might be more appropriate.

If we are exposing Combine methods, they should just mix all inputs and users should be required to call a Finalize method which takes the output from the last combine as well as total number of bytes that were combined to produce a finalized hash code (finalizing a hash code is important as it is what causes the bits to avalanche).

For the builder pattern, we are exposing an Add and ToHashCode method. It is not clear if the Add method is meant to store the bytes and only combine/finalize on the call to ToHashCode (in which case we can pick the correct algorithm dynamically) or if they are meant to be combined on the fly, it should be clear that this is the case (and that the implementation should be internally tracking the total size of bytes combined).

For anyone looking for a less complicated starting point, try xxHash32. That's likely to translate pretty easily to C# (people have done it).

Still testing locally, but I am seeing the following throughput rates for my C# implementation of Murmur3.

These are for the static Combine methods for 1-8 inputs:

1070.18 mb/s
1511.49 mb/s
1674.89 mb/s
1957.65 mb/s
2083.24 mb/s
2140.94 mb/s
2190.27 mb/s
2245.53 mb/s

My implementation assumes that GetHashCode should be called for each input and that the computed value should be finalized before being returned.

I combined int values, as they are the simplest to test.

To compute the throughput, I ran 10,001 iterations, throwing away the first iteration as the 'warmup' run.

In each iteration, I run 10,000 sub-iterations where I call HashCode.Combine, passing the result of the previous sub-iteration in as the first input value in the next iteration.

I then average all iterations to get the average elapsed time, further divide that by the number of sub-iterations run per loop to get the average time per call. I then compute the number of calls that can be made per-second and multiply that by the number of bytes combined to compute the actual throughput.

Will cleanup the code and share it in a bit.

@tannergooding, that sounds like great progress. To make sure you're getting the right measurements, the intent of the API is that a call to HashCode.Combine(a, b) is equivalent to calling

HashCode hc = new HashCode();
hc.Add(a); // Initializes the hash state, calls a.GetHashCode() and feeds the result into the hash state
hc.Add(b); // Calls b.GetHashCode() and feeds the result into the hash state
return hc.ToHashCode(); // Finalizes the hash state, truncates it to an int, resets the internal state and returns the int

In both cases, the data should get fed into the same internal hash state and the hash should be finalized once at the end.

馃憤

That is effectively what the code I wrote is doing. The only difference is that I effectively inline all the code (there is no need to allocate new HashCode() and track the number of bytes combined since it is constant).

@morganbr. Implementation + Throughput test for Murmur3: https://gist.github.com/tannergooding/89bd72f05ab772bfe5ad3a03d6493650

MurmurHash3 is based off the algorithm described here: https://github.com/aappleby/smhasher/wiki/MurmurHash3, repo says it is MIT

Working on xxHash32 (BSD-2 Clause -- https://github.com/Cyan4973/xxHash/blob/dev/xxhash.c) and SpookyHash (Public Domain -- http://www.burtleburtle.net/bob/hash/spooky.html) variants

@tannergooding Again, not a hash expert but I remembered [reading an article][1] that said Murmur was not DoS-resistant, so just pointing that out before we choose that.

@jamesqo, I might be wrong, but I'm fairly certain that vulnerability applied to Murmur2 and not Murmur3.

In either case, I am implementing several algorithms so that we can get throughput results for C#. The distribution and other properties of these algorithms are fairly well known so we can pick and choose which is the best later 馃槃

Whoops, forgot to link to the article: http://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/.

@tannergooding OK. Sounds fair :+1:

@tannergooding, I took a look at your Murmur3 implementation and it generally looks right and probably pretty well optimized. To make sure I understand correctly, are you using the fact that combinedValue and the internal state of Murmur are both 32 bits? That's probably a pretty good optimization for this case and explains some of my earlier confusion.

If we were to adopt it, it might need a couple tweaks (they probably won't make a big difference to perf measurements though):

  • Combine should still call CombineValue on value1
  • The first CombineValue calls should take a random seed
  • ToHashCode should reset _bytesCombined and _combinedValue

In the meantime while I'm longing for this API, how bad is it for me to be implementing GetHashCode via (field1, field2, field3).GetHashCode()?

@jnm2 , the ValueTuple hash code combiner tends to put your inputs in order in the hash code (and discard the least recent ones). For a couple of fields and a hash table that divides by a prime number, you might not notice. For a lot of fields or a hash table that divides by a power of two, the entropy of the last field you insert will have the most influence over whether you have collisions (e.g. if your last field is a bool or a small int, you'll probably have a lot of collisions, if it's a guid, you probably won't).

ValueTuple also doesn't work well with fields that are all 0.

On a side note, I had to stop working on other implementations (have higher priority work). Not sure when I'll be able to pick it back up.

So if that's not good enough for a structured type, why is it good enough for a tuple?

@jnm2, that's one reason this feature is worth building -- so we can go replace substandard hashes across the framework.

Large table of hash functions with performance and quality characteristic:
https://github.com/leo-yuriev/t1ha

@arespr I think the team is looking for a C# implementation of the hash functions. Thank you for sharing, though.

@tannergooding Are you still unable to pick this issue back up? If so then I'll post on Reddit/Twitter that we're looking for a hash expert.

edit: Made a post on Reddit. https://www.reddit.com/r/csharp/comments/6qsysm/looking_for_hash_expert_to_help_net_core_team/?ref=share&ref_source=link

@jamesqo, I have a few higher priority things on my plate and won't be able to get to this within the next 3 weeks.

Also, the current measurements will be limited by what we can currently code in C#, however, if/when this becomes a thing (https://github.com/dotnet/designs/issues/13), the measurements will likely change somewhat ;)

Also, the current measurements will be limited by what we can currently code in C#, however, if/when this becomes a thing (dotnet/designs#13), the measurements will likely change somewhat ;)

That's OK-- we can always change the hash algorithm once intrinsics become available, enscapulating/randomizing the hash code enables us to do that. We're just looking for something that offers the best performance/distribution tradeoff for the runtime in its current state.

@jamesqo, thanks for looking for folks to help out. We'd be happy to have somebody who isn't a hash expert work on this too -- we really just need somebody who can port some algorithms to C# from other languages or designs and then do performance measurements. Once we've chosen candidates, our experts will do what we do on any change -- review the code for correctness, performance, security, etc.

Hi! I've just read through the discussion, and at least to me it seems the case is closed strongly in favor of the murmur3-32 PoC. Which BTW seems like a very fine choice to me, and I'd recommend not spending any more needless work (but maybe even drop the .Add() members...).

But in the unlikely case that somebody wants to continue with more performance work, I could supply some code for xx32, xx64, hsip13/24, seahash, murmur3-x86/32 (and I integrated the marvin32 impl from above), and (yet unoptimized) sip13/24, spookyv2. Some versions of City look easy enough to port, should the need arise. That half-abandoned project had a slightly different use-case in mind, so there is no HashCode class with the proposed API; but for benchmarking it should not matter much.

Definitly not production-ready: the code applies generous amounts of brute-force like copy-pasta, cancerous sprawl of aggressive-inline and unsafe; endianess doesn't exist, neither do unaligned reads. Even tests against ref-impl test-vectors are euphemistically speaking "incomplete".

If this is any help at all, I should find enough time during the next two weeks to fix up the most egregious issues, and make the code and some preliminary results available.

@gimpf

I've just read through the discussion, and at least to me it seems the case is closed strongly in favor of the murmur3-32 PoC. Which BTW seems like a very fine choice to me, and I'd recommend not spending any more needless work

No, people aren't favoring Murmur3 yet. We want to make sure we're picking the absolute best algorithm in terms of balance between performance/distribution, so we can't leave any stone unturned.

But in the unlikely case that somebody wants to continue with more performance work, I could supply some code for xx32, xx64, hsip13/24, seahash, murmur3-x86/32 (and I integrated the marvin32 impl from above), and (yet unoptimized) sip13/24, spookyv2. Some versions of City look easy enough to port, should the need arise.

Yes, please! We want to gather code for as many algorithms as possible to test against. Every new algorithm you can contribute is valuable. It would be highly appreciated if you could port the City algorithms, too.

Definitly not production-ready: the code applies generous amounts of brute-force like copy-pasta, cancerous sprawl of aggressive-inline and unsafe; endianess doesn't exist, neither do unaligned reads. Even tests against ref-impl test-vectors are euphemistically speaking "incomplete".

That's OK. Just bring the code in, and someone else can find it if the need arises.

If this is any help at all, I should find enough time during the next two weeks to fix up the most egregious issues, and make the code and some preliminary results available.

Yes, that would be great!

@jamesqo Ok, I'll drop a note once I've something to show.

@gimpf that sounds really great and we'd love to hear about your progress as you go (no need to wait until you get to work through every algorithm!). Not production-ready is fine as long as you believe the code produces correct results and that the performance is a good representation of what we'd see in a production-ready implementation. Once we pick candidates, we can work with you on getting to high quality implementations.

I haven't seen analysis of how seahash's entropy compares to other algorithms. Do you have any pointers on that? It has interesting sounding perf tradeoffs... vectorization sounds fast, but modular arithmetic sounds slow.

@morganbr I've got a teaser ready.

About SeaHash: No, I've don't know about the quality yet; in case the performance is interesting, I'd add it to SMHasher. At least the author claims it is good (using it for checksums in a filesystem), and also claims that no entropy is thrown away during the mixing.

About the hashes and benchmarks: Project Haschisch.Kastriert, wiki page with first benchmarking results comparing xx32, xx64, hsip13, hsip24, marvin32, sea and murmur3-32.

Some important caveats:

  • This was a very quick bench run with low accuracy settings.
  • The implementations are not really done yet, and some contenders are still missing. The Streaming implementations (such a thing would become necessary for a sensible .Add() support) are in need of actual optimization.
  • SeaHash is currently not using a seed.

First impressions:

  • for large messages, xx64 is the fastest of the listed implementations (around 3.25 bytes per cycle, as far as I understand, or 9.5 GiB/s on my notebook)
  • for short messages, nothing is great, but murmur3-32, and (surprisingly) seahash have an edge, but the latter is likely explained by seahash not yet using a seed.
  • the "benchmark" for accessing a HashSet<> needs work, as everything is almost within measurement error (I've seen larger differences, but still not worth talking about)
  • when combining hash-codes, the murmur-3A PoC is around 5 to 20 times faster then what we have here
  • some abstractions in C# are very expensive; that makes comparing hash-algorithms more annoying than necessary.

I'll write you again once I've improved the situation a bit.

@gimpf, that's a fantastic start! I took a look at the code and results and I have a few questions.

  1. Your results show SimpleMultiplyAdd as about 5x slower than @tannergooding's Murmur3a. That seems odd since Murmur has more work to do than multiply+add (though I'll concede that rotate is a faster operation than add). Is it possible that your implementations have a common inefficiency that isn't in that Murmur implementation or should I read this as custom implementations having a big advantage over general-purpose ones?
  2. Having results for 1, 2, and 4 combinations is good, but this API goes up to 8. Would it be possible to get results for that as well or does that cause too much duplication?
  3. I saw that you ran on X64, so these results should help us along on choosing our X64 algorithm, but other benchmarks suggest that algorithms can differ pretty dramatically between X86 and X64. Is it easy for you to also get X86 results? (At some point, we'd also need to get ARM and ARM64, but those can definitely wait)

Your HashSet results are particularly interesting. If they hold up, that's a possible case for preferring better entropy over faster hash time.

@morganbr This weekend was more on-and-off, so progress is limited.

About your questions:

  1. Your results show SimpleMultiplyAdd as about 5x slower than @tannergooding's Murmur3a. That seems odd ...

I was wondering myself. That was a copy/paste error, SimpleMultiplyAdd was always combining four values... Also, by reordering some statements the multiply-add combiner got slightly faster (~60% higher throughput).

Is it possible that your implementations have a common inefficiency that isn't in that Murmur implementation or should I read this as custom implementations having a big advantage over general-purpose ones?

I likely miss some things, but it seems that for .NET general-purpose implementations are not usable for this use-case. I've written Combine-style methods for all algorithms, and w.r.t. hash-code combining most perform _much_ better than the general purpose ones.

However, even those implementations remain too slow; further work is needed. .NET performance in this area is absolutely opaque to me; adding or removing a copy of a local variable can easily change performance by a factor of two. I will likely not be able to provide implementations that are sufficently well-optimized for the purpose of selecting the best option.

  1. Having results for 1, 2, and 4 combinations is good, but this API goes up to 8.

I've extended the combine-benchmarks. No surprises on that front.

  1. I saw that you ran on X64 (...), Is it easy for you to also get X86 results?

It once was, but then I ported to .NET Standard. Now I'm in dependency-hell, and only .NET Core 2 and CLR 64bit benchmarks work. This can be resolved easily enough once I resolved the current issues.

Do you think this will make it in v2.1 release?

@gimpf You haven't posted in a while-- do you have a progress update on your implementations? :smiley:

@jamesqo I've fixed some benchmark which caused weird results, and added City32, SpookyV2, Sip13 and Sip24 to the list of available algorithms. The Sips are as fast as expected (relative to the throughput of xx64), City and Spooky are not (same is still true for SeaHash).

For combining hash-codes, Murmur3-32 still looks like a good bet, but I've have yet to run a more exhaustive comparison.

On another note, the streaming API (.Add()) has the unfortunate side effect of removing some hash algorithms from the list of candidates. Given that the performance of such an API is also questionable, you might want to rethink whether to offer it from the beginning.

If the .Add() part would be avoided, and given that the hash-combiner is using a seed, I don't think that there would be any harm in cleaning up tg's combiner, creating a small test-suite, and call it a day. Since I'm only having a few hours every weekend, and the performance-optimization is somewhat tedious, making the gold-plated version could drag on a bit...

@gimpf , that sounds like great progress. Do you have a results table handy so we can see if there's enough to make a decision and move forward?

@morganbr I've updated my benchmarking results.

For now I've only 64bit results on .NET Core 2. For that platform, City64 w/o seed is the fastest across all sizes. Incorporating a seed, XX-32 is tied with Murmur-3-32. Luckily enough these are the same algorithms that have the reputation to be fast for 32bit platforms, but obviously we need to verify that that holds true for my implementation as well. The results seem to be representative of real-world performance, except that Sea and SpookyV2 seem unusually slow.

You will need to consider how much you really need hash-dos protection for hash-code-combiners. If seeding is only needed to make the hash obviously unusable for persistence, city64 once XOR'd with a 32bit seed would be an improvement. As this utility is only there for combining hashes (and not replace for instance the hash-code for strings, or be a drop-in hasher for integer arrays etc.), that might be good enough.

If OTOH you think you need it, you'll be happy to see that Sip13 is usually less then 50% slower than XX-32 (on 64bit platforms), but that result will likely be significantly different for 32bit apps.

Don't know how much it's relevant to corefx, but I've added LegacyJit 32bit (w/FW 4.7) results.

I'd like to say that the results are ludicrously slow. However, as an example, at 56 MiB/s vs. 319 MiB/s I'm not laughing (that's Sip, it's missing the rotate-left optimization the most). I think I remember why I canceled my .NET hash-algorithm project in January...

So, RyuJit-32bit is still missing, and will (hopefully) give very different results, but for LegacyJit-x86, Murmur-3-32 wins handily, and only City-32 and xx-32 can come close. Murmur still has bad performance at only around 0.4 to 1.1 GB/s instead of 0.6 to 2 GB/s (on the same machine), but at least it's in the right ballpark.

I'm going to run the benchmarks on a few of my boxes tonight and post results (Ryzen, i7, Xeon, A10, i7 Mobile, and I think a couple others).

@tannergooding @morganbr Some nice and some important updates.

Important first:

  • I fixed some combine-implementations which were producing incorrect hash values.
  • The benchmark suite now works harder to avoid constant folding. City64 was susceptible (as was murmur-3-32 in the past). Doesn't mean that I understand every result now, but they are much more plausible.

Nice things:

  • Combiner implementations are now available for all 1 to 8 argument overloads, including the somewhat more cumbersome manually unrolled implementations for xx/city.
  • Tests and Benchmarks check those too. Since many hash algorithms have special-cased low-byte messages those measurements might be of interest.
  • Simplified running benchmarks for multiple targets (Core vs. FW).

To run a suite on all prime implementations for combining hash-codes, including "Empty" (pure overhead) and "multiply-add" (speed-optimized version of famous SO answer):

bin\Release\net47\Haschisch.Benchmarks.Net47.exe -j:clr_x86 -j:clr_x64_legacy -j:clr_x64 -j:core_x64 -- CombineHashCode --allcategories=prime

(_Running 32bit Core benchmarks conveniently seems to require prerelease BenchmarkDotNet (or maybe a 32bit-only set-up plus using the Core based bench-runner). It should then work using -j:core_x86, hopefully)_

Results: After all bugfixing, xx32 seems to win for all overloads w/64 bit RyuJIT, on Windows 10 on a mobile Haswell i7, in a "quick" run. Between the Sips and marvin32, Sip-1-3 always wins. Sip-1-3 is about 4 times slower than xx32, which again is about 2 times slower than a primitive multiply-add combiner. 32bit Core results are still missing, but I am more or less waiting for a stable BenchmarkDotNet release that will solve that issue for me.

(Edit) I just added a quick-run of a benchmark for accessing a hash-set. This is obviously much more dependent on details than the 碌-benchmarks above, but you might want to give it a look.

Thanks once again @gimpf for the fantastic data! Let's see if we can turn that into a decision.

To start with, I'd divide the algorithms like this:
Fast+Good entropy (ordered by speed):

  1. xxHash32
  2. City64 (This will probably be slow on x86, so we'll probably have to pick something else for x86)
  3. Murmur3A

HashDoS resistant:

  • Marvin32
  • SipHash. If we lean toward this, we'll need to get it reviewed by Microsoft's crypto experts to confirm the research results are acceptable. We'll also have to figure out which parameters are secure enough. The paper suggests somewhere between Sip-2-4 and Sip-4-8.

Out of contention (slow):

  • SpookyV2
  • City32
  • xxHash64
    *SeaHash (and we don't have data on entropy)

Out of contention (bad entropy):

  • MultiplyAdd
  • HSip

Before we pick a winner, I'd like to make sure other folks agree with my bucketing above. If it holds, I'd think we just need to choose whether to pay 2x for HashDoS resistance and then go by speed.

@morganbr Your grouping seems fine. As a data-point in SipHash rounds, the Rust project asked Jean-Philippe Aumasson, who authored sip-hash w/DJB. After that discussion they decided to go for sip-1-3 for hash-tables.

(See PR rust:#33940 and the accompanying issue rust:#29754).

Based on the data and comments, I'd like to propose that we use xxHash32 on all architectures. The next step is to get it implemented. @gimpf, are you interested in putting together a PR for that?

For those concerned about HashDoS, I will follow up soon with a proposal for a general-purpose hashing API that should include Marvin32 and may include SipHash. That will also be an appropriate place for the other implementations @gimpf and @tannergooding have worked on.

@morganbr I can put together a PR as time permits. Also, I personally would prefer xx32 too, as long as it doesn't reduce acceptance.

@gimpf , how's your time looking? If you don't really have time, we can also see if anyone else would like to give it a shot.

@morganbr I'd planned to do it until 5th of November, and it still looks good that I'll find the time in the next two weeks.

@gimpf , sounds great. Thanks for the update!

@terrajobst - I'm a bit late to the party (sorry), but can't we change the return type of the Add method?

```c#
public HashCode Add(T value);
public HashCode Add(T value, IEqualityComparer comparer);

The params code is clearly there for scenarios where you have multiple fields, e.g.

```c#
        public override int GetHashCode() => new HashCode().Add(Name, Surname).ToHashCode();

However, exactly the same thing can be achieved like this, albeit with one less wasteful array allocation:

c# public override int GetHashCode() => new HashCode().Add(Name).Add(Surname).Add(Age).ToHashCode();

Note that types can also be mixed. This could obviously be done by not calling it fluently inside of a regular method. Given this argument that the fluent interface is not absolutely necessary, why does the wasteful params overload exist to begin with? If this suggestion is a bad suggestion, then the params overload falls to the very same axe. That, and forcing a regular method for a trivial yet optimal hashcode seems like a lot of ceremony.

Edit: An implicit operator int would also be nice for DRY, but not exactly crucial.

@jcdickinson

can't we change the return type of the Add method?

We already discussed that in the old proposal, and it was rejected.

why does the wasteful params overload exist to begin with?

We are not adding any params overloads? Do a Ctrl+F for "params" on this webpage, and you'll see that your comment is the only place where that word pops up.

An implicit operator int would also be nice for DRY, but not exactly crucial.

I believe that was also discussed somewhere above...

@jamesqo thanks for the explanation.

params overloads

I meant AddRange, but I guess I there won't be any traction on this.

@jcdickinson AddRange was in the original proposal, but it's not in the current version. It was rejected by API review (see https://github.com/dotnet/corefx/issues/14354#issuecomment-308190321 by @terrajobst):

We should remove all AddRange methods because the scenario is unclear. Arrays are somewhat unlikely to show up very often. And once larger arrays are involved, the question is whether the computation should be cached. Seeing the for loop on the calling side makes it clear that you need to think about that.

@gimpf I went ahead an polyfilled the proposal with xxHash32. Feel free to grab that implementation. It has tests against actual xxHash32 vectors.

Edit

Regarding the interface. I am fully aware that I am making a mountain out of a molehill - feel free to ignore. I am using the current proposal against real stuff and it is a lot of annoying repetition.

I've been playing around with the interface and now understand why the fluent interface was rejected; it's significantly slower.

BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-4800MQ CPU 2.70GHz (Haswell), ProcessorCount=8
Frequency=2630626 Hz, Resolution=380.1377 ns, Timer=TSC
.NET Core SDK=2.0.2
  [Host]     : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT

Using a non-inlined method as a hash code source; 50 invocations of Add vs a fluent extension method:

| Method | Mean | Error | StdDev | Scaled |
|------- |---------:|---------:|---------:|-------:|
| Add | 401.6 ns | 1.262 ns | 1.180 ns | 1.00 |
| Tally | 747.8 ns | 2.329 ns | 2.178 ns | 1.86 |

However, the following pattern does work:

```c#
public struct HashCode : System.Collections.IEnumerable
{
[EditorBrowsable(EditorBrowsableState.Never)]
[Obsolete("This method is provided for collection initializer syntax.", error: true)]
public IEnumerator GetEnumerator() => throw new NotImplementedException();
}

public override int GetHashCode() => new HashCode()
{
    Age, // int
    { Name, StringComparer.Ordinal }, // use Comparer
    Hat // some arbitrary object
}.ToHashCode();

```

It also has identical performance characteristics to the current proposal:

| Method | Mean | Error | StdDev | Scaled |
|------------ |---------:|---------:|---------:|-------:|
| Add | 405.0 ns | 2.130 ns | 1.889 ns | 1.00 |
| Initializer | 400.8 ns | 4.821 ns | 4.274 ns | 0.99 |

Sadly it is somewhat a hack, as the IEnumerable has to be implemented to keep the compiler happy. That being said, the Obsolete will error out on even foreach - you'd have to really want to break things in order to run into the exception. The MSIL across the two is essentially identical.

@jcdickinson thanks for grabbing the issue. I sent you Collaborator invite, let me know when you accept and I will be able to assign this issue to you (assigning to myself in the mean time).

Pro-tip: Once you accept, GitHub will automatically sign you up for all notifications from the repo (500+ per day), I would recommend to change it to just "Not Watching" which will send you all your mentions and notifications for issues you subscribed to.

@jcdickinson , I'm definitely interested in ways to avoid annoying repetition (though I have no idea how folks would feel about initializer syntax). I seem to recall that there were two problems with fluent :

  1. The perf issue you noted
  2. The return value from fluent methods is a copy of the struct. It's too easy to accidentally end up losing input doing things like:
var hc = new HashCode();
var newHc = hc.Add(foo);
hc.Add(bar);
return newHc.ToHashCode();

Since the proposal on this thread is already approved (and you're well on your way to getting it merged), I'd suggest starting a new API proposal for any changes.

@karelz I believe @gimpf already grabbed this issue beforehand. Since he has more familiarity with the implementation, please assign this issue to @gimpf instead. (edit: nvm)

@terrajobst One kind of last-minute API request for this. Since we marked GetHashCode obsolete, we're implicitly telling the user that HashCodes are not values meant to be compared, despite being structs which are typically immutable/comparable. In that case, should we mark Equals obsolete too?

[Obsolete("HashCode is a mutable struct and should not be compared with other HashCodes.", error: true)]
[EditorBrowsable(Never)]
// If this is too harsh, base.Equals() is fine as long as the [Obsolete] stays
public override bool Equals(object obj) => throw new NotSupportedException("HashCode is a mutable struct and should not be compared with other HashCodes.");

I think something similar was done with Span.

If that's accepted, then I think...

  1. I'd consider using should not, or may not instead of cannot in the Obsolete message.
  2. Provided the Exception stays, I would put the same string in its message, just in case the method gets called through a cast or open generic.

@Joe4evr Fine with me; I've updated the comment. It may also be beneficial to include the same message in the GetHashCode exception too, then:

public override int GetHashCode() => throw new NotSupportedException("HashCode is a mutable struct and should not be compared with other HashCodes.");

@morganbr Why did you reopen this?

The PR to get it exposed in CoreFX has not gone through yet.

@gimpf do you have the code you benchmarked available and/or would you be able to quickly see how the SpookilySharp nuget package fairs. I'm looking to dust that project off after a couple of year's stagnation and I'm curious to see how it stands up.

@JonHanna He posted it here: https://github.com/gimpf/Haschisch.Kastriert

@JonHanna, I'd be interested to hear how your testing goes so we can start thinking about what would be useful in a general-purpose non-cryptographic hashing API.

@morganbr Where would be an appropriate forum to discuss such an API? I expect that such an API would consist of more than just the lowest common denominator, and maybe a good API will also need an improved JIT w.r.t. handling of larger structs. Discussing all that might be better done in a separate issue...

@gimpf Opened one for you. dotnet/corefx#25666

@morganbr - Can we get the package name & version no which will include this commit?

@karelz, might you be able to help @smitpatel with package/version info?

I would try daily build of .NET Core - I'd wait until tomorrow.
I don't think there is a package you can simply take dependency on.

Question for the participants here. The Roslyn IDE allows users to generate a GetHashCode impl based on a set of fields/properties in their class/struct . Ideally, people could use the new HashCode.Combine that was added in https://github.com/dotnet/corefx/pull/25013 . However, some users will not have access to that code. So, we'd like to still be able to generate a GetHashCode that will work for them.

Recently, it came to our attention that the form we generate is problematic. Namely, because VB compiles with overflow checks on by default, and our impl will cause overflows. Also, VB has no way to disable overflow checks for a region of code. It's either on or off entirely for the entire assembly.

Because of this, i'd love to be able to replace the impl we provide with a form that doesn't suffer from these problems. Ideally, the form generated would have the following properties:

  1. One/two lines in GetHashCode per field/property used.
  2. No overflowing.
  3. Reasonably good hashing. We're not expecting amazing results. But something that has hopefully already been vetted to be decent, and to not have the problems you usually get with a + b + c + d or a ^ b ^ c ^ d.
  4. No additional dependencies/requirements on the code.

For example, one option for VB would be to generate something like:

return (a, b, c, d).GetHashCode()

But this then depends on having a reference to System.ValueTuple. Ideally, we could have an impl that works even in the absence of that.

Does anyone know about a decent hashing algorithm that can work with these constraints? Thanks!

--

Note: our existing emitted code is:

        Dim hashCode = -252780983
        hashCode = hashCode * -1521134295 + i.GetHashCode()
        hashCode = hashCode * -1521134295 + j.GetHashCode()
        Return hashCode

This clearly can overflow.

This is also not a problem for C# as we can just add unchecked { } around that code. That fine-grained control is not possible in VB.

Does anyone know about a decent hashing algorithm that can work with these constraints? Thanks!

Well, you could do Tuple.Create(...).GetHashCode(). Obviously that incurs allocations, but it seems better than throwing an exception.

Is there any reason you can't just tell the user to install System.ValueTuple? Since it's a builtin language feature, I'm sure the System.ValueTuple package is very compatible with basically all platforms right?

Obviously that incurs allocations, but it seems better than throwing an exception.

Yes. it would be nice to not have it cause allocations.

Is there any reason you can't just tell the user to install System.ValueTuple?

That would be the behavior if we generate the ValueTuple approach. However, again, it would be nice if we could just generate something good that fits with the way the user has currently structured their code, without making them change their structure in a heavyweight way.

It really does seem like VB users should have a way to address this problem in a reasonable manner :) But such an approach is eluding me :)

@CyrusNajmabadi, If you really need to do your own hash calculation in the user's code, CRC32 might work since it's a combination of table lookups and XORs (but not arithmetic that can overflow). There are some drawbacks though:

  1. CRC32 doesn't have great entropy (but it's likely still better than what Roslyn emits now).
  2. You'd need to put a 256 entry lookup table somewhere in the code or emit code to generate the lookup table.

If you're not doing it already, I'd hope you can detect the HashCode type and use that when possible since XXHash should be much better.

@morganbr See https://github.com/dotnet/roslyn/pull/24161

We do the following:

  1. Use System.HashCode if it is available. Done.
  2. Otherwise, if in C#:
    2a. If not in checked-mode: Generate unrolled hash.
    2b. If in checked-mode: Generate unrolled hash, wrapped in 'unchecked{}'.
  3. Otherwise, if in VB:
    3b. If not in checked-mode: Generate unrolled hash.
    3c. If in checked-mode, but has access to System.ValueTuple: Generate Return (a, b, c, ...).GetHashCode()
    3d. If in checked-mode without access to System.ValueTuple. Generate unrolled hash, but add a comment in VB that overflows are very likely.

It's '3d' that's really unfortunate. Basically, someone using VB but not using ValueTuple or a recent System, will not be able to use us to get a reasonable hash algorithm generated for them.

You'd need to put a 256 entry lookup table somewhere in the code

This would be completely unpalatable :)

Is table-generation code also unpalatable? At least going by Wikipedia's example, it's not much code (but it still has to go somewhere in the user's source).

How awful would it be to add the HashCode source to the project like Roslyn does (with IL) with (the much simpler) compiler attribute class definitions when they aren't available through any referenced assembly?

How awful would it be to add the HashCode source to the project like Roslyn does with (the much simpler) compiler attribute class definitions when they aren't available through any referenced assembly?

  1. Does the HashCode source not need overflow behavior?
  2. I've skimmed the HashCode source. It's non trivial. Generating all that goop into the user's project would be pretty heavyweight.

I'm just surprised there are no good ways to get overflow math to work in VB at all :(

So, at a minimum, even if we were hashing two values together, it seems like we would have to create:

```c#
var hc1 = (uint)(value1?.GetHashCode() ?? 0); // can overflow
var hc2 = (uint)(value2?.GetHashCode() ?? 0); // can overflow

        uint hash = MixEmptyState();
        hash += 8; // can overflow

        hash = QueueRound(hash, hc1);
        hash = QueueRound(hash, hc2);

        hash = MixFinal(hash);
        return (int)hash; // can overflow
Note that this code already has 4 lines that can overflow.  It also has two helper functions you need to call (i'm ignoring MixEmptyState as that seems more like a constant).  MixFinal can *definitely* overflow:

```c#
        private static uint MixFinal(uint hash)
        {
            hash ^= hash >> 15;
            hash *= Prime2;
            hash ^= hash >> 13;
            hash *= Prime3;
            hash ^= hash >> 16;
            return hash;
        }

as can QueueRound:

c# private static uint QueueRound(uint hash, uint queuedValue) { hash += queuedValue * Prime3; return Rol(hash, 17) * Prime4; }

So i don't honestly see how this would work :(

How awful would it be to add the HashCode source to the project like Roslyn does (with IL) with (the much

How do you envision this working? What would customers write, and what would the compilers then do in response?

Also, something that would address all of this is if .Net already has public helpers exposed on the surface API that convert from uint to int32 (and vice versa) without overflow.

Do those exist? If so, i can easily write the VB versions, just using these for the situations where we need to go between the types without overflowing.

Is table-generation code also unpalatable?

I would think so. I mean, think about this from a customer perspective. They just want a decent GetHashCode method that is nicely self contained and gives reasonable results. Having that feature go and bloat up their code with auxiliary crap is going to be pretty unpleasant. It's also pretty bad given that the C# experience will be just fine.

You might be able to get roughly the right overflow behavior by casting to and from some combination of signed and unsigned 64-bit types. Something like this (untested and I don't know VB casting syntax):

Dim hashCode = -252780983
hashCode = (Int32)((Int32)((Unt64)hashCode * -1521134295) + (UInt64)i.GetHashCode())

How do you knwo the following doesn't overflow?

c# (Int32)((Unt64)hashCode * -1521134295)

Or the final (int32) cast for that matter?

I didn't realize it would use overflow-checked conv operations. I guess you could mask it down to 32 bits before casting:

(Int32)(((Unt64)hashCode * -1521134295) & 0xFFFFFFFF)

presumably 31 bits, as a value of uint32.Max would also overflow on conversion to Int32 :)

That's def possible. Ugly... but possible :) There's gunna be a lot of casts in this code.

Ok. I think i have a workable solution. The core of the algorithm we generate today is:

c# hashCode = hashCode * -1521134295 + j.GetHashCode();

Let's say that we're doing 64bit math, but "hashCode" has been capped to 32 bits. Then <largest_32_bit> * -1521134295 + <largest_32_bit> will not overflow 64 bits. So we can always do the math in 64 bits, then clamp down to 32 (or 32bits) to ensure that the next round won't overflow.

Thanks!

@MaStr11 @morganbr @sharwell and everyone here. I've updated my code to generate the following for VB:

        Dim hashCode As Long = 2118541809
        hashCode = (hashCode * -1521134295 + a.GetHashCode()) And Integer.MaxValue
        hashCode = (hashCode * -1521134295 + b.GetHashCode()) And Integer.MaxValue
        Return CType(hashCode And Integer.MaxValue, Integer)

Can someone sanity check me to make sure that this makes sense and should not overflow even with checked mode on?

@CyrusNajmabadi , that won't overflow (because Int64.Max = Int32.Max*Int32.Max and your constants are much smaller than that) but you're masking the high bit to zero, so it's only a 31-bit hash. Is leaving the high bit on considered an overflow?

@CyrusNajmabadi hashCode is a Long that can be anywhere from 0 to Integer.MaxValue. Why am I getting this?

image

But no, it can't actually overflow.

Btw- I'd rather have Roslyn add a NuGet package than add a suboptimal hash.

but you're masking the high bit to zero, so it's only a 31-bit hash. Is leaving the high bit on considered an overflow?

That's a good point. I think i was thinking about another algorithm that was using uints. So in order to safely convert from the long to a uint, i needed to not include the sign bit. However, as this is all signed math, i think it would be fine to just mask against 0xffffffff ensuring we only keep the bottom 32bit after adding each entry.

I'd rather have Roslyn add a NuGet package than add a suboptimal hash.

Users can already do that if they want. This is about what to do when users do not, or can not, add those dependencies. This is also about providing a reasonably 'good enough' hash for users. i.e. something better than the common "x + y + z" approach that people often take. It's not intended to be 'optimal' because there's no good definition of what 'optimal' is when it comes to hashing for all users. Note that the approach we're taking here is the one already emitted by the compiler for anonymous types. It exhibits reasonably good behavior while not adding a ton of complexity to the user's code. As time, as more and more users are able to move forward, such can can slowly disappear and be replaced with HashCode.Combine for most people.

So i worked at it a bit and came up with the following that i think addresses all concerns:

        Dim hashCode As Long = 2118541809
        hashCode = (hashCode * -1521134295 + a.GetHashCode()).GetHashCode()
        hashCode = (hashCode * -1521134295 + b.GetHashCode()).GetHashCode()
        Return CType(hashCode, Integer)

The piece that's interesting is specifically calling .GetHashCode() on the int64 value produced by (hashCode * -1521134295 + a.GetHashCode()). Calling .GetHashCode on this 64 bit value has two good properties for our needs. First, it ensures that hashCode only ever stores a legal int32 value in it (which makes the final returning cast always safe to perform). Second, it ensures that we don't lose any valuable information in the upper 32bits of the int64 temp value we're working with.

@CyrusNajmabadi Actually offering to install the package is what I was asking about. Saves me from having to do it.

If you type HashCode, then if System.HashCode is provided in an MS nuget package, then Roslyn will offer it.

I want it to generate the nonexistent GetHashCode overload and install the package in the same operation.

I don't think that's an appropriate choice for most users. Adding dependencies is a very heavyweight operation that users should not be forced into. Users can decide the right time to make those choices, and the IDE will respect it. That's been the approach we've taken with all our features up to now, and it's been a healthy one that people seem to like.

Note: what nuget package is this api even being included in for us to add a reference to?

The implementation is in System.Private.CoreLib.dll, so it would come as part of the runtime package. The contract is System.Runtime.dll.

Ok. If that's the case, then it sounds like a user would get this if/when they move to a more recent Target Framework. That sort of thing is not at all a step i would have the "generate equals+hashcode" do to a user's project.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GitAntoinee picture GitAntoinee  路  3Comments

nalywa picture nalywa  路  3Comments

jkotas picture jkotas  路  3Comments

yahorsi picture yahorsi  路  3Comments

sahithreddyk picture sahithreddyk  路  3Comments