We've run into an issue where we get lots of InconsistentStateExceptions, such as this:

I have created a project to reproduce this issue; it is the OrleansInconsistentStateException folder from my experimental BitBucket repository.
Change the connection string in the silo project's OrleansConfiguration.xml.
The client program repeatedly calls an async void method. This is intentional; it simulates the case we ran into where we have event handlers from a third party library firing. We can't await those.
It is possible that you'll get database-level deadlocks instead of InconsistentStateExceptions at times. Rerun it a few times.
Both the InconsistentStateExceptions and database-level deadlocks are problematic. The load isn't particularly large here.
@dandago Thanks for persevering and working on the issue. I'll link what seems to become the conslusion of the relevant discussion in Gitter for the benefit of others: https://gitter.im/dotnet/orleans?at=581f59e12d4796175f4f637a.
This is a particularly troublesome scenario, because the storage expects SELECT and UPDATE dominate and tries to minimise space and memory usage and index maintenance and now there's a flood of INSERT operations in what is a mixed load scenario. Could you try the way that fixes issue for @danvanderboom?
Basically remove the holdlock hint from WriteToStorage and cand run ALTER TABLE Storage SET(LOCK_ESCALATION = DISABLE); and then modifying both WriteToStorageKey and ClearStorageKey to have the UPDATE like this: UPDATE Storage WITH(ROWLOCK)?
@veikkoeeva I just tried it, it still happens.
I have managed to replicate the InconsistentStateExceptions with as little a 20 iterations of the loop (with the modified script).
I'm attaching the modified script I used according to your instructions.
@dandago I'll try to check the code today. Just looking at the code I see that
var friend = GrainClient.GrainFactory.GetGrain<IPerson>(i);
await friend.SayHelloAsync("Hello!"); // <<<<-------- !!!!!
is called in an async void function that is called in a for loop here. It creates instantly multiple activations.
As a more general note, as you can see in the script InconsistentStateException means that if the grain state version (ETag) was null, it was tried to be inserted and if the insertion didn't change the version number, the insertion condition failed and it is repoted as an InconsistentStateException. Likewise for the update branch – when state is not null – which is lighter on locking.
The call to client initialize should be outside the while loop and the call to say hello should be awaited or waited.
@dandago (and @danvanderboom) I can confirm there is a problem and it basically boils down to either having deadlocks or not having them and having other problems. The example @dandago seem to work if introducing a slight delay in the loop (I had to increate the loop counter to reproduce reliably, but I believe I have hang of it now). I see what I can do about this.
@gabikliot It appears the initialization is in the loop to wait for the silo to come up alive. Can you tell if there should be a problem in feeding the client with what amounts multiple concurrent calls? I modified the database script a bit to remove deadlocking when inserting the state for the first time, say for about 10 000 calls. What I find weird is that when running a loop like this, some grains end up in the database having a Jenkins hash different than others, say about 150 of the just mentioned 10 000 calls. Then making another round of calls fails these 150 with InconsistentStateException since it can't find a version for them.
This hash thing is rather puzzling, since it is taken consistently (i.e. here, JenkinsHash looks to be deterministic too.
I also modified the example to use reliable storage and gateway from the DB.
@gabikliot It appears Encoding.UTF8 isn't thread safe and that causes the problem. Man...
Just confirmed that how the ADO.NET storge uses the JenkinsHash function is problematic. For in this specific case of calling 10 000 times in a tight loop introduces about 100–150 objects with a badly hashed ID. I just wrote a small test for myself using like this
`const int TestGrainHash = 1491221312;
var grainType = "Grains.PersonGrain";
Parallel.For(0, 10000, i =>
{
var jenkins = JenkinsHash.Factory.GetHashGenerator(threadLocal: true);
int grainTypeHash = unchecked((int)jenkins.ComputeHash(new UTF8Encoding().GetBytes(grainType)));
Assert.Equal(TestGrainHash, grainTypeHash);
});
that seem to confirm this finding. Having it created outside will likely fail. (Not that in that example thread local variable isn't needed.)
I think I can provide fixes already today. Also for the logging issue where the ID in failure case is concanted twice in the row.
<edit: This code snippet has caused a great deal of confusion. For the benefit of anyone reading this, I try to be lear: this snippet was the end result of something that did not crash, irrespective of the threadLocal: true parameter. This was discussed in chat and I tried to provide a few failure modes, here is the one that tripped me: https://gitter.im/dotnet/orleans?at=58237b87e462097a302ba4ab.
Wow!!!! Amazing! We had a similar bug years ago, discovered that random was not thread safe. It tends to always return zero if used in non thread safe manner.
We are still testing 1.3.1. There were some perf concerns we are checking. If the fix is available, we could include it in 1.3.1.
Just to clarify, the bug it only in the AdoNetStorageProvider, right?
All I can say is it's not in the MemoryStorageProvider. I haven't tried any others.
@sergeybykov Well, I managed to replicate the problem in tests and fix the one about thread safe hashing and the another one on deadlocking. For some reason I'm not unable to get @dandago's solution currently to run (complains about JSON Serializer) even though I managed to run it earlier. So, in that respect I'm not currently 100% sure if all is good. I can open PR shortly if that is OK.
@dandago and @danvanderboom I managed to replicate the problem consistently in tests and fix it according to the tests. I hit a problem in testing with @dandago's solution that it complains about JSON.NET and I have to postpone a retry to some later time (I hope before Friday, but it depends on how quickly I get it solved and how confident you guys are this fixes things). If you want to give the fixes a try, they're in the linked branch.
I managed to try the example solution and given a large enough loop counter, the hash corruption bug still manifests itself. The culprit is to be found in these spots: https://github.com/dotnet/orleans/pull/2395/files#diff-3513be85e1d850f44d469c131ef1bfb1R231. It puzzles me what happens between the string grainType parameter and the resulting grainTypeHash that for a small number of calls introduces a different hash than to others. I'lll try to investigate tomorrow, but if you anyone can spot anything, all the better. :) I introduces also a test for this case: https://github.com/dotnet/orleans/pull/2395/files#diff-8800f3c30eebdc4e28b5d0891a7d3fadR19 (sadly, it passes).
This can be detected, e.g. by running the linked sample program and then querying:
SELECT COUNT(*) FROM Storage (check there are the same amount of rows as in the counter loop).
SELECT * FROM Storage ORDER BY GrainIdN1; (all the grains)
SELECT * FROM Storage WHERE GrainTypeHash <> 1491221312 ORDER BY GrainIdN1; (the bad grains).
An added note, I've considered adding an (unbounded) cache for the hashes, which may also mitigate against this problem. Albeit I'd like to know what is happening here.
@sergeybykov, @dandago, @danvanderboom The trouble in testing seem to be located between the chair and the keyboard. I did a new run by linking the provided solution directly to Orleans, fixing the references and running it and the fixes seem to work. I run loops of 100 000, first inserting, then updating, shutting down the program and then again. No deadlocks, no InconsistenStateExceptions, version numbers as they should. This should confirm that I could consistently replicate the trouble in tests and then the fix fixed them in practice and according to my understanding of the occurred situation.
I will supply some additional comments later today to some parts of the code, but otherwise I would dare to claim this problem has been solved.
We are still narrowing down a perf regressions in the 1.3.1 candidate. So if you submit a PR, we should be able to take it for 1.3.1 I think.
Most helpful comment
@gabikliot It appears
Encoding.UTF8isn't thread safe and that causes the problem. Man...Just confirmed that how the ADO.NET storge uses the JenkinsHash function is problematic. For in this specific case of calling 10 000 times in a tight loop introduces about 100–150 objects with a badly hashed ID. I just wrote a small test for myself using like this
that seem to confirm this finding. Having it created outside will likely fail. (Not that in that example thread local variable isn't needed.)
I think I can provide fixes already today. Also for the logging issue where the ID in failure case is concanted twice in the row.
<edit: This code snippet has caused a great deal of confusion. For the benefit of anyone reading this, I try to be lear: this snippet was the end result of something that did not crash, irrespective of the
threadLocal: trueparameter. This was discussed in chat and I tried to provide a few failure modes, here is the one that tripped me: https://gitter.im/dotnet/orleans?at=58237b87e462097a302ba4ab.