Roslyn: Visual Studio Code-Fix for generating GetHashCode results in OverflowException in checked assembly

Created on 4 Jan 2018 · 36Comments · Source: dotnet/roslyn

Version Used:
Microsoft Visual Studio Enterprise 2017
Version 15.5.2

Steps to Reproduce:

Create a new project with "Check for arithmetic overflow/underflow" enabled in the project Build settings
Add the following class:

class Foo
{
    object bar;
    string baz;
}

Place the cursor immediately after Foo
Press Ctrl+. (period)
Select "Generate Equals and GetHashCode()..."
Create a new Foo object and call GetHashCode()

Full code:

using System.Collections.Generic;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            new Foo().GetHashCode();
        }
    }

    class Foo
    {
        object bar;
        string baz;

        public override bool Equals(object obj)
        {
            var foo = obj as Foo;
            return foo != null &&
                   EqualityComparer<object>.Default.Equals(bar, foo.bar) &&
                   baz == foo.baz;
        }

        public override int GetHashCode()
        {
            var hashCode = -1438245972;
            hashCode = hashCode * -1521134295 + EqualityComparer<object>.Default.GetHashCode(bar);
            hashCode = hashCode * -1521134295 + EqualityComparer<string>.Default.GetHashCode(baz);
            return hashCode;
        }
    }
}

Expected Behavior:

The compiled program exits with code 0, printing no output.

Actual Behavior:

The compiled program crashes with a System.OverflowException:

System.OverflowException
  HResult=0x80131516
  Message=Arithmetic operation resulted in an overflow.
  Source=ConsoleApp1
  StackTrace:
   at ConsoleApp1.Foo.GetHashCode() in C:\temp\ConsoleApp1\ConsoleApp1\Program.cs:line 29
   at ConsoleApp1.Program.Main(String[] args) in C:\temp\ConsoleApp1\ConsoleApp1\Program.cs:line 9

I expect that the IDE code-fix should wrap the method body in an unchecked block when the assembly is checked.

Area-IDE Bug Resolution-Fixed

Source

yaakov-h

All 36 comments

Duplicate of #24093

MaStr11 on 9 Jan 2018

@MaStr11 I went the other way around with the duplicates, but thanks for pointing it out 👍

sharwell on 9 Jan 2018

Comment copied from #24093

This is the method than needs to be changed:
ICodeDefinitionFactoryExtensions_CreateGetHashCodeMethod.cs
This method currently doesn't mentions overflows in checked environments at all. ~@CyrusNajmabadi did most of the changes in commit f6eaea66.~ The history of this goes back until before Roslyn was open sourced. ~He may can answer why that problem wasn't taken into account.~

Albeit there is a problem with VB to consider: It doesn't support checked/unchecked (see the stackoverflow thread Overriding GetHashCode in VB without checked/unchecked keyword support? for details). I think that's also why there aren't any Checked or Unchecked methods on SyntaxGenerator.

IMHO this is best solved by providing language specific implementations. And I fear that the VB version would be quite different from the C# version and needs to be well designed before it can be implemented here. There are some implementations given in the stackoverflow thread above but some of them can't be used (need an extra struct), are using a long as intermediate value or might be not well balanced.

MaStr11 on 9 Jan 2018

He may can answer why that problem wasn't taken into account.

I didn't think of it when i wrote it originally :)

CyrusNajmabadi on 9 Jan 2018

Related dotnet/corefx#25013. The fixer could generate calls to System.HashCode.Combine on supported (future) platforms.

MaStr11 on 10 Jan 2018

Yes. The corefx bug was originally pushed on (by us) for this precise reason :)

CyrusNajmabadi on 10 Jan 2018

See: https://github.com/dotnet/roslyn/issues/17646

CyrusNajmabadi on 10 Jan 2018

As far as I can tell, that only addresses the issue when targeting >= .NET Core 2.1 and/or some hypothetical future version of .NET Framework / Mono / Xamarin.

Is there a (reasonable) way to fix this for the runtimes that are in use today?

yaakov-h on 10 Jan 2018

Is there a (reasonable) way to fix this for the runtimes that are in use today?

I see two possible approaches:

Keep the existing algorithm for VB and C

This would require language specific additions to the currently generated MethodStatements:
C#
```C#
unchecked
{
// current MethodStatements
}

VB
```VB
' The following implementation relies on integer overflows. Make sure to 
' compile your assembly with /removeintchecks+ compiler option
' https://docs.microsoft.com/en-us/dotnet/visual-basic/reference/command-line-compiler/removeintchecks
' current MethodStatements

Use an algorithm that doesn't relies on overflows

This will likely be some combination of shift and xor. Given the discussions in the corefx repo (dotnet/corefx#8034 Adding a HashCode type to help with combining hash codes, and the follow up dotnet/corefx#14354) with more than 350 comments it seems difficult to find a consensuses what a good algorithm might look like.

Just to get started. This would be my list of requirements (derived from the currently generated code)

It is compact (one line per member, plus one line for init of a seed and the return statement)
It is easy to exclude members from the hash by deleting the line
It is language independent (does not rely on C# only or VB only language features)
It has reasonable hashing quality (whatever that means)
Is readily understandable and self-explanatory without being too mystical
Avoids NREs on member access

I'm by no means an hash algorithm expert but I looked around at stackoverflow to find an algorithm that fits to the requirements above and it seems xor and shift might be a fit. That is based on a comment by @JonHanna that is somewhat buried in a longer thread (emphases by me):

A simple shift on one of the values before xoring is only marginal extra cost for a good reduction in collision. Prime-number multiplication increasing both time and quality again. Which is better between shift or mult is hence debatable. Plain xor though very often has a lot of collisions on real data and is best avoided

But there is a catch (Source MSDN Object.GetHashCode):

A second alternative solution involves weighting the individual hash codes by left-shifting the hash codes of successive fields by two or more bits. Optimally, instead of being discarded, bits shifted beyond bit 31 should wrap around rather than be discarded. Since bits are discarded by the left-shift operators in both C# and Visual Basic, this requires creating a left shift-and-wrap method like the following:

MaStr11 on 10 Jan 2018

A rotation through something like hashCode = (hashCode << 5) | (hashCode >> 27) would wrap reasonably well and I AFAIK the jitter is smart enough to turn that into a single rotation operation.

JonHanna on 10 Jan 2018

hashCode = (hashCode << 5) | (hashCode >> 27)

Nice. Added Isn't (too) mystical to the list of requirements though :stuck_out_tongue_winking_eye:

MaStr11 on 10 Jan 2018

I was thinking of "isn't too mystical" as one of the benefits of the above. That the jitter treats it well is perhaps a little mystical, but not something one has to know. I'd expect someone who got the point of a shift to see that two shifts could make sense, even if not quite getting the point of a particular combination.

JonHanna on 10 Jan 2018

@JonHanna Agreed. I added readily understandable and self-explanatory to be more precise and raise the bar a bit.

MaStr11 on 10 Jan 2018

I think it's enough that someone could be able to look a the code and say "that should certainly mix up the bits, I hope it mixes them well" rather than being able to understand every nuance. (Even the very well known multiply-by-31 isn't fully understood AFAIK having seen a study [alas I can't find it] that found it worked better in practice than multiplying by other Mersenne primes, but the authors couldn't explain why).

JonHanna on 10 Jan 2018

👍1

@JonHanna given your experience in this field and your general reputation your voice carries far more weight than mine. So I think it would be valuable to have one or more concrete recommendations from you. May I ask you what your preferred generated code would be based on what's generated today:
C# public override int GetHashCode() { var hashCode = -246786518; hashCode = hashCode * -1521134295 + EqualityComparer<PropertyClass1>.Default.GetHashCode(Property1); hashCode = hashCode * -1521134295 + EqualityComparer<PropertyClass2>.Default.GetHashCode(Property2); return hashCode; }

📋

The requirements above should be met (especially: don't rely on overflow)
xor and shift are just uneducated guesses by me so any other algorithm is welcome
Maybe a local helper function to combine two hashes is acceptable, but that's beyond me to decide.

MaStr11 on 10 Jan 2018

given your experience in this field

Does many years of flip-flopping really count as experience :grin: (In my defence, there's no perfect balance between simplicity, speed of execution and quality of hash, especially in general cases that code generation has to deal with, so I don't feel too bad for moving between those).

As far as I can reason, I think you're right about xor and shift.

Using System.HashCode.Combine where possible would IMO be ideal, but that of course isn't always available.

I'd consider:

C# public override int GetHashCode() { var hashCode = -246786518; hashCode ^= EqualityComparer<PropertyClass1>.Default.GetHashCode(Property1); hashCode = (hashCode << 5) | (hashCode >> 27); // 32 - 5 hashCode ^= EqualityComparer<PropertyClass2>.Default.GetHashCode(Property2); return hashCode; }

This make rotating the bits an intermediary step, between each addition of an affecting property, so it avoids both the issues around XOR alone and around losing bits due to shifting in just one direction.

There might be better ideas to be gleaned from the discussions that happened at https://github.com/dotnet/corefx/issues/14354 and https://github.com/dotnet/corefx/issues/8034 (though they didn't have the no-overflow requirement). And better input from those who contributed to that work.

JonHanna on 10 Jan 2018

👍1

It may be worth considering adjusting the amount rotated to be 32 / propertyCount so that if there are e.g. four properties and then tend to have small values (relatively common) they end up distributed throughout the whole 32-bit range. This would perform badly on factor-of-two–based hash tables, though (rotation does particularly less well than multiplication generally, for that matter).

JonHanna on 10 Jan 2018

I dont' think we have to overthink things greatly. The purpose of the existing feature is to just generally give a decent hashing starting point that is better htan the a + b + c ... or a ^ b^ c ... approach that people often take. I used the existing algorithm simply because it's what the C# and VB compilers already produce for anonymous types, and that's been sufficient for now.

I would not deeply overthink the situation here. For c# we can just add the unchecked block if the project has checked mode on. For VB we can add a comment. Users can then change things accordingly.

CyrusNajmabadi on 10 Jan 2018

For VB we can add a comment. Users can then change things accordingly.

Do you mean something like what I proposed a few comments above:

' The following implementation relies on integer overflows. Make sure to
' compile your assembly with /removeintchecks+ compiler option
' https://docs.microsoft.com/en-us/dotnet/visual-basic/reference/command-line-compiler/removeintchecks
' current MethodStatements

MaStr11 on 10 Jan 2018

Yes. We would add this if we saw that the code was compiled with overflow checks on.

CyrusNajmabadi on 10 Jan 2018

Note:

I would be fine going with something like:

c# hashCode ^= EqualityComparer<PropertyClass1>.Default.GetHashCode(Property1); hashCode = (hashCode << 5) | (hashCode >> 27)

As long as its semi reasonable and produces something better than " + + + + + ". My major point was that we don't consider it important to produce an "ultra amazing hash that solves all hashing problems out there". If you have such a need, use a library. If you just want to make your simple data types not exhibit common pathological problems, use our output.

CyrusNajmabadi on 11 Jan 2018

Ok. Whipped up the fix.

CyrusNajmabadi on 11 Jan 2018

If you just want to make your simple data types not exhibit common pathological problems, use our output.

Or if you just want to shut up FxCop's CA2218. 😄

yaakov-h on 11 Jan 2018

That too :)

CyrusNajmabadi on 11 Jan 2018

Hey guys/ @JonHanna , @sharwell asked the following https://github.com/dotnet/roslyn/pull/24161#pullrequestreview-88326987:

❓ Did you mean for this to be a rotation? If so, the right shift needs to be unsigned.

What's the appropriate thing to do here? Should i just do everything unsigned, and then cast to signed int back on the result? Or should i used signed throughout, but do an explict (int)((uint)hashCode >> 27) to ensure an unsigned right shift. Or are any changes here unnecessary?

Thanks!

CyrusNajmabadi on 12 Jan 2018

I don't know. All I can say is that @sharwell seems to be right with that uint needs to be used given this example from MSDN:
```C#
public int ShiftAndWrap(int value, int positions)
{
positions = positions & 0x1F;

// Save the existing bit pattern, but interpret it as an unsigned integer.
uint number = BitConverter.ToUInt32(BitConverter.GetBytes(value), 0);
// Preserve the bits to be discarded.
uint wrapped = number >> (32 - positions);
// Shift and wrap the discarded bits.
return BitConverter.ToInt32(BitConverter.GetBytes((number << positions) | wrapped), 0);

}
```
Source Object.GetHashCode

MaStr11 on 12 Jan 2018

Ok. That's certainly annoying as it will very much clutter up the code. But i guess it's necessary :)

CyrusNajmabadi on 12 Jan 2018

as it will very much clutter up the code

What about using language dependent implementations as C# has the unchecked so we could keep the current implementation? The current implementation has the following advantages (taken from this comment above):

It is compact (one line per member, plus one line for init of a seed and the return statement)
It is easy to exclude members from the hash by deleting the line
~It is language independent (does not rely on C# only or VB only language features)~
It has reasonable hashing quality (whatever that means)
Is readily understandable and self-explanatory without being too mystical
Avoids NREs on member access

MaStr11 on 12 Jan 2018

What about using language dependent implementations as C# has the unchecked so we could keep the current implementation?

But then waht about VB? I'd far prefer to have one single approach for both languages.

It is compact (one line per member, plus one line for init of a seed and the return statement)

Enh. if you want compact, go use HashCode.Combine when it's available. THis is generated code, as long as it's not insane, i'm not sure there's a problem.

It is easy to exclude members from the hash by deleting the line

Enh. Just delete two lines.

Avoids NREs on member access

Same with the new one.

CyrusNajmabadi on 12 Jan 2018

Or should i used signed throughout, but do an explict (int)((uint)hashCode >> 27) to ensure an unsigned right shift.

Probably the better. (The cast is nicely a nop as far as the CIL goes).

JonHanna on 12 Jan 2018

The conversion fails also (in checked environments). See #24161.

MaStr11 on 12 Jan 2018

So what's a good algorithm that can work in checked conditions? Need something that's reasonable for VB when compiled with checked-mode on.

Note: if there isn't one, i'm fine with having C# just wrap the original code with an unchecked{} block, and having a comment emitted at the top of VB that this could be a problem if checked-mode is on.

CyrusNajmabadi on 12 Jan 2018

The conversion fails also (in checked environments).

D'oh!

i'm fine with having C# just wrap the original code with an unchecked{}

I'm inclined to think this is the best approach here.

JonHanna on 12 Jan 2018

PR has been updated to try out this approach. Feedback welcome.

CyrusNajmabadi on 13 Jan 2018

So what's a good algorithm that can work in checked conditions?

After I slept over it, I found two options that might gone work.

Option 1 - Shift left by 1

```C#
hashCode = (hashCode << 1) + (hashCode < 0 ? 1 : 0);


Shift left by 1 is basically an overflow save multiplication by two. Only the sign gets lost in the operation and is therefore wrapped *by hand*.

## Option 2 – Use ^ the instead of |

```C#
hashCode = (hashCode << 5) ^ (hashCode >> 27)

This isn’t a rotation anymore, but it solves the problem with int being signed:

If the first operand is an int or long, the right-shift is an arithmetic shift (high-order empty bits are set to the sign bit).

Source MSDN

Examples (bits inserted by shift emphased):

This would also work with an |.

637.770.668
Orginal      : 00100110000000111001101110101100
ShiftLeft 5  : 110000000111001101110101100*00000*
ShiftRight 27: *000000000000000000000000000*00100
Final  xor   : 11000000011100110111010110000100

But this would have discarded 27 bits if | is used:


-637.770.668
Orginal      : 11011001111111000110010001010100
ShiftLeft 5  : 001111111000110010001010100*00000*
ShiftRight 27: *111111111111111111111111111*11011
Final xor    : 11000000011100110111010101111011

Not being a rotation isn’t a problem because we only need something that shuffles around the bits (see next section)

Option n – Find another mapping function

What hashCode = (hashCode << 5) ^ (hashCode >> 27) needs to solve is making the sequence of a.GetHashcode() ^ b.GetHashcode() matter (a=5 and b=7 gives the same result as a=7 and b=5). This is done by mapping the value of a.GetHashcode() to another value before the xor is applied.
Therefore, we are looking for an operation that maps every possible value of int (int.MinValue to int.MaxValue) to another value of int, so that:

There is a 1:1 relationship between the original and the mapped value (The original is mapped to a unique value, operation can be reversed)
The original and the mapped value are shuffled around (successive original values are no
longer successive (or even close to the original) in the mapped set). This is needed to avoid collisions if GetHashcode() returns values biased to a certain range (as e.g. int.GetHashcode() does by returning its identity and that is likely in the lower positive numbers range).
Option 1 and 2 should fulfill the first condition but option 2 better fulfills the second.

I'm pretty sure there are other shuffeling algorithms that can do that (without relying on overflows). One example is to xor with a constant (e.g. 0b10101010101010101010101010101010).

Disclaimer: None of the two options are thoughtfully tested.

MaStr11 on 13 Jan 2018

@MaStr11 I have asked the question on https://github.com/dotnet/corefx/issues/14354 to see if any of the participants there have a good idea about what may be a good idea here.

CyrusNajmabadi on 15 Jan 2018

Was this page helpful?

0 / 5 - 0 ratings