We've been getting this exception for a while in various grains. Particularly one which is activated in a StartupTask that should never be deactivated. If I've understood correctly this can happen when there are duplicate activations of a grain (this happens perhaps during deployments) - so question is what do you do when that happens? Do you have to catch it and DeactivateOnIdle? Do you call ReadStateAsync and lose that data? We had a grain that during the lifetime of the activation never again could write to state so I'm wondering how you recover from it. I couldn't find info on this in docs, maybe I missed it
@martinothamar Just to be clear, was the situation such that it started upon first failure and then kept repeating or you would like to avoid this situation in this particular grain that should be never deactivated? I'm afraid the grains are automatically deactivated upon InconsistentStateException, see at https://github.com/dotnet/orleans/issues/1609.
Just in case, can you share the database vendor? This means there was neither an UPDATE or INSERT, which means, if there's state already that the WHERE condition fails for a reason or another. Version (ETag) violation is likely the most common reason.
Let's page @sergeybykov too. I haven't in fact had problems with this and hence haven't paid attention. I should. :)
Yes, it kept repeating. Thing is that I already had a try catch around the block in which I call WriteStateAsync, so the activation is never disabled. I just log the error and move on (rest of the call chain moves on). Wondering what the best practice is here
Database vendor is MSSQL
@martinothamar Did you try to re-read the state to have the version updated?
No, right now it just logs and forget that error - wasn't expecting to run into this as often as I do. Is there any way to find the root cause of how I get into that state?
If there's a duplicate activation how can I find out and handle it? What other causes are there?
Recently I had to deal with this issue as well.
For me the trouble came from interleaving code in stateful grains.
In particular orleans timers can be a source of this problem.
As soon as 2 calls end up on WriteStateAsync() at the same time, you have a good chance for this error to happen.
In my opinion ETags should prevent different activations from writing to the same state but the current design doesn't even allow the same grain to write multiple times.
Right, one of the grains I've experienced this the most with is in fact triggered by a timer quite often.
Another grain I'm now experiencing this with subscribes to multiple streams (5+) and will write to state upon every message from every stream (maybe this is wrong).
Wouldn't expect interleaving in either case though?
Stacktrace:
Exc level 0: Orleans.Storage.InconsistentStateException: Version conflict (WriteState): ServiceId=68e89d8e-60d5-43ba-992e-4759d46921345 ProviderName=SqlServerState GrainType=Grains.SomeGrain GrainId=980726 ETag=46.
Timers execute interleaved. Not sure about streams.
I also wanna add that this issue has been discussed before. #2565
A possible workaround via self-invocation for timers was posted here: #2574
Interesting, that definitely solves one of my issues, thanks. I would think stream messages are queued like normal, if not I have som refactoring to do
I would think stream messages are queued like normal
They are. Timers are the only interleaving bug that has become a feature that some people now depend on.
Then I'm not sure what could go wrong. What could cause this exception for a grain functioning roughly like the one below? Is it always due to double activations?
[StorageProvider(ProviderName = "SqlServerState")]
[ImplicitStreamSubscription("Namespace")]
[ImplicitStreamSubscription("Namespace2")]
[ImplicitStreamSubscription("Namespace3")]
[ImplicitStreamSubscription("Namespace4")]
[ImplicitStreamSubscription("Namespace5")]
public class SomeGrain : Grain<SomeGrainState>, ISomeGrain
{
public async Task OnActivateAsync()
{
var sp = GetStreamProvider("SMSProvider");
var key = GetPrimaryKey();
await Task.WhenAll(
sp.GetStream<Thing>(key, "Namespace").SubscribeAsync((msg, _) => DoStuff()),
sp.GetStream<Thing2>(key, "Namespace2").SubscribeAsync((msg, _) => DoStuff()),
sp.GetStream<Thing3>(key, "Namespace3").SubscribeAsync((msg, _) => DoStuff()),
sp.GetStream<Thing4>(key, "Namespace4").SubscribeAsync((msg, _) => DoStuff()),
sp.GetStream<Thing5>(key, "Namespace5").SubscribeAsync((msg, _) => DoStuff())
);
}
public async Task DoStuff()
{
try {
....
await Task.WhenAll(someOtherStream.OnNextAsync(...), WriteStateAsync()); // throws InconsistentStateException
} catch (Exception ex) {
// Log exception
}
}
}
Hmm looking at it now I'm wondering since we pass delegates which call DoStuff in that way, do I get interleaving here after all? There's a bit of work involving awaiting db calls and other grain calls and such in the .... in the try catch block
It does look like you'll get multiple WriteStateAsync() calls interleaving because they are at the tails of fan-out calls that get joined via Task.WhenAll().
What, does the order of the calls in Task.WhenAll matter? (SomeGrain does not subscribe to someOtherStream if that's what you mean) How do I make DoStuff not interleave?
Sorry, I think I might have misread the code. If the grain in non-reentrant (default), then an InconsistentStateException could only be thrown if there's more than one WriteStateAsync() call in this try block that can interleave.
try {
....
await Task.WhenAll(someOtherStream.OnNextAsync(...), WriteStateAsync()); // throws InconsistentStateException
}
Unless, of course, I'm missing something here.
That's the thing, WriteStateAsync is only called right there, and the grain is non-reentrant
Hmm... that's puzzling. I wonder if @jason-bragg or @xiazen have some thoughts.
To answer your origianl question (one of them):
so question is what do you do when that happens? Do you have to catch it and DeactivateOnIdle? Do you call ReadStateAsync and lose that data?
If you let InconsistentStateException escape, the grain should get automatically deactivated. That's the simplest solution, essentially equivalent to calling DeactivateOnIdle.
Alternatively, you can choose to refresh state via ReadStateAsync. Just need to make sure there aren't any private variable that would conflict with that state.
If you're catching the InconsistentStateException and not rethrowing it, the grain will not deactivate so it will stay in an inconsistent state forever. You either need to reload the state or kill the grain on InconsistentStateException. Otherwise, the pattern looks fine, as far as I can tell.
What stream provider are you using? It shouldn't affect the WriteStateAsync behavior but different stream providers support different recoverability patterns.
The SMSProvider currently, what is the reccommended persistent stream these days?
Thats kind of what I though in terms of deactivate or read state, but then I still dont know why it happens in the first place and the state in that grain is pretty important. It鈥檚 supposed to hold and store some last known numerical values, history isnt important. I鈥檇 be tempted to just override the ETag stuff but that鈥檚 probably not possible in the provider as is
What is the reccommended persistent stream these days?
Event hub stream provider is the most widely used. If recoverablity is desired, but data loss is acceptable during silo crashes, the memory stream provider could be considered as well.
I still dont know why it happens in the first place
If it happens before the silo joins the cluster it's more likely that duplicate activations will occur. When is the bootstrap logic called in the lifecycle?
Some services have used the subscription manager to setup subscriptions without activating the grain, allowing activity on the stream to activate the instance. Using implicit subscriptions would also allow such behavior.
Taking a second look at the activation logic, subscriptions are created on every activation, which could lead to duplicate subscriptions. I'd advocate checking for existing subscriptions prior to creating new subscriptions. If existings subscriptions exist, resuming processing the stream from the existing subscriptions is recommended. Or minimally removing old subscriptions.
What storage provider are you useing for your pubsub state?
Are streams being generated or consumed from orleans clients, or is this all silo-to-silo streaming?
@martinothamar I would be curious to know the reason too. Can you estimate the update frequency?
Regargless of the reason, it seems that the grain will be stuck once it gets a wrong state number. That can happen in some rare situations. Would it be more robust to deacticate or reload in problem situation in any case?
@jason-bragg
Event hub stream provider is the most widely used. If recoverablity is desired, but data loss is acceptable during silo crashes, the memory stream provider could be considered as well.
Thanks! Are there any examples on how to configure the Memory stream provider? I tried looking once but didn't find it.
If it happens before the silo joins the cluster it's more likely that duplicate activations will occur.
I though of this too but this happened when the cluster has been up already for hours, I'm pretty sure the activation was relatively new (minutes).
I'll refactor in resuming of existing streams.
Using memory grain storage for the stream provider with
sOpts.FireAndForgetDelivery = true;
sOpts.OptimizeForImmutableData = true;
It's all silo-to-silo communication.
@veikkoeeva
This grains has to do with sports events (think football/soccer), so the grain is called through streams every time something happens in a sport event.
Update frequency can therefore be atleast less than every second. Every 5 seconds for example.
I'm not sure yet if deactivate or reload would be best yet. Maybe - read, set the values and then write again. I know this can be a little insecure
Are there any examples on how to configure the Memory stream provider?
No examples. :/ Only test code. https://github.com/dotnet/orleans/blob/master/test/Tester/StreamingTests/MemoryStreamProviderClientTests.cs
The memory streams were developed for dev/test use, allowing developers of persistent streams to run and test their logic locally without each developer having to have their own configured persistent queue (which can get cumbersome and expensive). I am unaware of any team using them in production, but there is no reason they can't be used in production, assuming one understands that all the events are stored in memory so data loss can occur during silo failure.
Recoverability
For some insights into recovering from failures using recoverable streams you may want to check out our stream recoverability test harness and some of the error faults generated in the grain. All of the cases we cover in the tests are from cases found by teams running production services where recovery was needed (using eventhub)
Test Harness:
https://github.com/dotnet/orleans/blob/master/test/Tester/StreamingTests/ImplicitSubscritionRecoverableStreamTestRunner.cs
EventHub tests using stream recoverablity test harness:
https://github.com/dotnet/orleans/blob/master/test/Extensions/ServiceBus.Tests/Streaming/EHImplicitSubscriptionStreamRecoveryTests.cs
Grains simulating recovery error cases:
https://github.com/dotnet/orleans/blob/master/test/Grains/TestGrains/ImplicitSubscription_NonTransientError_RecoverableStream_CollectorGrain.cs
https://github.com/dotnet/orleans/blob/master/test/Grains/TestGrains/ImplicitSubscription_RecoverableStream_CollectorGrain.cs
https://github.com/dotnet/orleans/blob/master/test/Grains/TestGrains/ImplicitSubscription_TransientError_RecoverableStream_CollectorGrain.cs
The above grains and tests use implicit subscriptions, so they may not always check for subscription prior to subscribing (they should, but may not), but you'll need to if you're using explicit subscriptions. The error cases may help you think through some possible error scenarios prior to going into production, especially those related to deduping events.
For recoverable streams, events are delivered from (inclusive) the sequence token provided during subscribe (or resume). Any error returned by the OnNext will trigger the redelivery of the same event until the opperation succeeds or the retry logic gives up, in which case OnError will receive an error telling the grain that the event could not be delivered, in which case the event will be skipped, and the stream will move to the next event, unless the grain takes some action (like subscribing to an earlyer point, unsubscribes, .. ?). This allows a grain to accumulate results for some period, persisting the accumulated state and the sequence token periotically (rather than on each call). This is possible because any error which causes a deactivation of the grain will rewind the stream to the previous token, allowing the reprocessing of the events since the last checkpoint.
@martinothamar, akk.. some times my brain just doesn't work.. you're using implicit subscriptions.. For some reason I got it in my head that you were using explicit subscriptions. Please ignore my comments regarding resuming from existing subscriptions and the subscription manager. Implicit subscriptions always have only a single subscription, so you're code is fine. I appologize for any confusion. :/