Version Used:
Microsoft Visual Studio Enterprise 2017
Version 15.5.2
Steps to Reproduce:
class Foo
{
object bar;
string baz;
}
FooFoo object and call GetHashCode()Full code:
using System.Collections.Generic;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
new Foo().GetHashCode();
}
}
class Foo
{
object bar;
string baz;
public override bool Equals(object obj)
{
var foo = obj as Foo;
return foo != null &&
EqualityComparer<object>.Default.Equals(bar, foo.bar) &&
baz == foo.baz;
}
public override int GetHashCode()
{
var hashCode = -1438245972;
hashCode = hashCode * -1521134295 + EqualityComparer<object>.Default.GetHashCode(bar);
hashCode = hashCode * -1521134295 + EqualityComparer<string>.Default.GetHashCode(baz);
return hashCode;
}
}
}
Expected Behavior:
The compiled program exits with code 0, printing no output.
Actual Behavior:
The compiled program crashes with a System.OverflowException:
System.OverflowException
HResult=0x80131516
Message=Arithmetic operation resulted in an overflow.
Source=ConsoleApp1
StackTrace:
at ConsoleApp1.Foo.GetHashCode() in C:\temp\ConsoleApp1\ConsoleApp1\Program.cs:line 29
at ConsoleApp1.Program.Main(String[] args) in C:\temp\ConsoleApp1\ConsoleApp1\Program.cs:line 9
I expect that the IDE code-fix should wrap the method body in an unchecked block when the assembly is checked.
Duplicate of #24093
@MaStr11 I went the other way around with the duplicates, but thanks for pointing it out 👍
Comment copied from #24093
This is the method than needs to be changed:
ICodeDefinitionFactoryExtensions_CreateGetHashCodeMethod.cs
This method currently doesn't mentions overflows in checked environments at all. ~@CyrusNajmabadi did most of the changes in commit f6eaea66.~ The history of this goes back until before Roslyn was open sourced. ~He may can answer why that problem wasn't taken into account.~
Albeit there is a problem with VB to consider: It doesn't support checked/unchecked (see the stackoverflow thread Overriding GetHashCode in VB without checked/unchecked keyword support? for details). I think that's also why there aren't any Checked or Unchecked methods on SyntaxGenerator.
IMHO this is best solved by providing language specific implementations. And I fear that the VB version would be quite different from the C# version and needs to be well designed before it can be implemented here. There are some implementations given in the stackoverflow thread above but some of them can't be used (need an extra struct), are using a long as intermediate value or might be not well balanced.
He may can answer why that problem wasn't taken into account.
I didn't think of it when i wrote it originally :)
Related dotnet/corefx#25013. The fixer could generate calls to System.HashCode.Combine on supported (future) platforms.
Yes. The corefx bug was originally pushed on (by us) for this precise reason :)
As far as I can tell, that only addresses the issue when targeting >= .NET Core 2.1 and/or some hypothetical future version of .NET Framework / Mono / Xamarin.
Is there a (reasonable) way to fix this for the runtimes that are in use today?
Is there a (reasonable) way to fix this for the runtimes that are in use today?
I see two possible approaches:
This would require language specific additions to the currently generated MethodStatements:
C#
```C#
unchecked
{
// current MethodStatements
}
VB
```VB
' The following implementation relies on integer overflows. Make sure to
' compile your assembly with /removeintchecks+ compiler option
' https://docs.microsoft.com/en-us/dotnet/visual-basic/reference/command-line-compiler/removeintchecks
' current MethodStatements
This will likely be some combination of shift and xor. Given the discussions in the corefx repo (dotnet/corefx#8034 Adding a HashCode type to help with combining hash codes, and the follow up dotnet/corefx#14354) with more than 350 comments it seems difficult to find a consensuses what a good algorithm might look like.
Just to get started. This would be my list of requirements (derived from the currently generated code)
I'm by no means an hash algorithm expert but I looked around at stackoverflow to find an algorithm that fits to the requirements above and it seems xor and shift might be a fit. That is based on a comment by @JonHanna that is somewhat buried in a longer thread (emphases by me):
A simple shift on one of the values before xoring is only marginal extra cost for a good reduction in collision. Prime-number multiplication increasing both time and quality again. Which is better between shift or mult is hence debatable. Plain xor though very often has a lot of collisions on real data and is best avoided
But there is a catch (Source MSDN Object.GetHashCode):
A second alternative solution involves weighting the individual hash codes by left-shifting the hash codes of successive fields by two or more bits. Optimally, instead of being discarded, bits shifted beyond bit 31 should wrap around rather than be discarded. Since bits are discarded by the left-shift operators in both C# and Visual Basic, this requires creating a left shift-and-wrap method like the following:
A rotation through something like hashCode = (hashCode << 5) | (hashCode >> 27) would wrap reasonably well and I AFAIK the jitter is smart enough to turn that into a single rotation operation.
hashCode = (hashCode << 5) | (hashCode >> 27)
Nice. Added Isn't (too) mystical to the list of requirements though :stuck_out_tongue_winking_eye:
I was thinking of "isn't too mystical" as one of the benefits of the above. That the jitter treats it well is perhaps a little mystical, but not something one has to know. I'd expect someone who got the point of a shift to see that two shifts could make sense, even if not quite getting the point of a particular combination.
@JonHanna Agreed. I added readily understandable and self-explanatory to be more precise and raise the bar a bit.
I think it's enough that someone could be able to look a the code and say "that should certainly mix up the bits, I hope it mixes them well" rather than being able to understand every nuance. (Even the very well known multiply-by-31 isn't fully understood AFAIK having seen a study [alas I can't find it] that found it worked better in practice than multiplying by other Mersenne primes, but the authors couldn't explain why).
@JonHanna given your experience in this field and your general reputation your voice carries far more weight than mine. So I think it would be valuable to have one or more concrete recommendations from you. May I ask you what your preferred generated code would be based on what's generated today:
C#
public override int GetHashCode()
{
var hashCode = -246786518;
hashCode = hashCode * -1521134295 + EqualityComparer<PropertyClass1>.Default.GetHashCode(Property1);
hashCode = hashCode * -1521134295 + EqualityComparer<PropertyClass2>.Default.GetHashCode(Property2);
return hashCode;
}
📋
given your experience in this field
Does many years of flip-flopping really count as experience :grin: (In my defence, there's no perfect balance between simplicity, speed of execution and quality of hash, especially in general cases that code generation has to deal with, so I don't feel too bad for moving between those).
As far as I can reason, I think you're right about xor and shift.
Using System.HashCode.Combine where possible would IMO be ideal, but that of course isn't always available.
I'd consider:
C#
public override int GetHashCode()
{
var hashCode = -246786518;
hashCode ^= EqualityComparer<PropertyClass1>.Default.GetHashCode(Property1);
hashCode = (hashCode << 5) | (hashCode >> 27); // 32 - 5
hashCode ^= EqualityComparer<PropertyClass2>.Default.GetHashCode(Property2);
return hashCode;
}
This make rotating the bits an intermediary step, between each addition of an affecting property, so it avoids both the issues around XOR alone and around losing bits due to shifting in just one direction.
There might be better ideas to be gleaned from the discussions that happened at https://github.com/dotnet/corefx/issues/14354 and https://github.com/dotnet/corefx/issues/8034 (though they didn't have the no-overflow requirement). And better input from those who contributed to that work.
It may be worth considering adjusting the amount rotated to be 32 / propertyCount so that if there are e.g. four properties and then tend to have small values (relatively common) they end up distributed throughout the whole 32-bit range. This would perform badly on factor-of-two–based hash tables, though (rotation does particularly less well than multiplication generally, for that matter).
I dont' think we have to overthink things greatly. The purpose of the existing feature is to just generally give a decent hashing starting point that is better htan the a + b + c ... or a ^ b^ c ... approach that people often take. I used the existing algorithm simply because it's what the C# and VB compilers already produce for anonymous types, and that's been sufficient for now.
I would not deeply overthink the situation here. For c# we can just add the unchecked block if the project has checked mode on. For VB we can add a comment. Users can then change things accordingly.
For VB we can add a comment. Users can then change things accordingly.
Do you mean something like what I proposed a few comments above:
' The following implementation relies on integer overflows. Make sure to
' compile your assembly with /removeintchecks+ compiler option
' https://docs.microsoft.com/en-us/dotnet/visual-basic/reference/command-line-compiler/removeintchecks
' current MethodStatements
Yes. We would add this if we saw that the code was compiled with overflow checks on.
Note:
I would be fine going with something like:
c#
hashCode ^= EqualityComparer<PropertyClass1>.Default.GetHashCode(Property1);
hashCode = (hashCode << 5) | (hashCode >> 27)
As long as its semi reasonable and produces something better than " + + + + + ". My major point was that we don't consider it important to produce an "ultra amazing hash that solves all hashing problems out there". If you have such a need, use a library. If you just want to make your simple data types not exhibit common pathological problems, use our output.
Ok. Whipped up the fix.
If you just want to make your simple data types not exhibit common pathological problems, use our output.
Or if you just want to shut up FxCop's CA2218. 😄
That too :)
Hey guys/ @JonHanna , @sharwell asked the following https://github.com/dotnet/roslyn/pull/24161#pullrequestreview-88326987:
❓ Did you mean for this to be a rotation? If so, the right shift needs to be unsigned.
What's the appropriate thing to do here? Should i just do everything unsigned, and then cast to signed int back on the result? Or should i used signed throughout, but do an explict (int)((uint)hashCode >> 27) to ensure an unsigned right shift. Or are any changes here unnecessary?
Thanks!
I don't know. All I can say is that @sharwell seems to be right with that uint needs to be used given this example from MSDN:
```C#
public int ShiftAndWrap(int value, int positions)
{
positions = positions & 0x1F;
// Save the existing bit pattern, but interpret it as an unsigned integer.
uint number = BitConverter.ToUInt32(BitConverter.GetBytes(value), 0);
// Preserve the bits to be discarded.
uint wrapped = number >> (32 - positions);
// Shift and wrap the discarded bits.
return BitConverter.ToInt32(BitConverter.GetBytes((number << positions) | wrapped), 0);
}
```
Source Object.GetHashCode
Ok. That's certainly annoying as it will very much clutter up the code. But i guess it's necessary :)
as it will very much clutter up the code
What about using language dependent implementations as C# has the unchecked so we could keep the current implementation? The current implementation has the following advantages (taken from this comment above):
What about using language dependent implementations as C# has the unchecked so we could keep the current implementation?
But then waht about VB? I'd far prefer to have one single approach for both languages.
It is compact (one line per member, plus one line for init of a seed and the return statement)
Enh. if you want compact, go use HashCode.Combine when it's available. THis is generated code, as long as it's not insane, i'm not sure there's a problem.
It is easy to exclude members from the hash by deleting the line
Enh. Just delete two lines.
Avoids NREs on member access
Same with the new one.
Or should i used signed throughout, but do an explict
(int)((uint)hashCode >> 27)to ensure an unsigned right shift.
Probably the better. (The cast is nicely a nop as far as the CIL goes).
The conversion fails also (in checked environments). See #24161.
So what's a good algorithm that can work in checked conditions? Need something that's reasonable for VB when compiled with checked-mode on.
--
Note: if there isn't one, i'm fine with having C# just wrap the original code with an unchecked{} block, and having a comment emitted at the top of VB that this could be a problem if checked-mode is on.
The conversion fails also (in checked environments).
D'oh!
i'm fine with having C# just wrap the original code with an unchecked{}
I'm inclined to think this is the best approach here.
PR has been updated to try out this approach. Feedback welcome.
So what's a good algorithm that can work in checked conditions?
After I slept over it, I found two options that might gone work.
```C#
hashCode = (hashCode << 1) + (hashCode < 0 ? 1 : 0);
Shift left by 1 is basically an overflow save multiplication by two. Only the sign gets lost in the operation and is therefore wrapped *by hand*.
## Option 2 – Use ^ the instead of |
```C#
hashCode = (hashCode << 5) ^ (hashCode >> 27)
This isn’t a rotation anymore, but it solves the problem with int being signed:
If the first operand is an int or long, the right-shift is an arithmetic shift (high-order empty bits are set to the sign bit).
Examples (bits inserted by shift emphased):
This would also work with an |.
637.770.668
Orginal : 00100110000000111001101110101100
ShiftLeft 5 : 110000000111001101110101100*00000*
ShiftRight 27: *000000000000000000000000000*00100
Final xor : 11000000011100110111010110000100
But this would have discarded 27 bits if | is used:
-637.770.668
Orginal : 11011001111111000110010001010100
ShiftLeft 5 : 001111111000110010001010100*00000*
ShiftRight 27: *111111111111111111111111111*11011
Final xor : 11000000011100110111010101111011
Not being a rotation isn’t a problem because we only need something that shuffles around the bits (see next section)
What hashCode = (hashCode << 5) ^ (hashCode >> 27) needs to solve is making the sequence of a.GetHashcode() ^ b.GetHashcode() matter (a=5 and b=7 gives the same result as a=7 and b=5). This is done by mapping the value of a.GetHashcode() to another value before the xor is applied.
Therefore, we are looking for an operation that maps every possible value of int (int.MinValue to int.MaxValue) to another value of int, so that:
I'm pretty sure there are other shuffeling algorithms that can do that (without relying on overflows). One example is to xor with a constant (e.g. 0b10101010101010101010101010101010).
Disclaimer: None of the two options are thoughtfully tested.
@MaStr11 I have asked the question on https://github.com/dotnet/corefx/issues/14354 to see if any of the participants there have a good idea about what may be a good idea here.