Orleans: InconsistentStateException with ADO.NET SQL Server grain storage

Created on 18 Jun 2018 · 24Comments · Source: dotnet/orleans

We've been getting this exception for a while in various grains. Particularly one which is activated in a StartupTask that should never be deactivated. If I've understood correctly this can happen when there are duplicate activations of a grain (this happens perhaps during deployments) - so question is what do you do when that happens? Do you have to catch it and DeactivateOnIdle? Do you call ReadStateAsync and lose that data? We had a grain that during the lifetime of the activation never again could write to state so I'm wondering how you recover from it. I couldn't find info on this in docs, maybe I missed it

Source

martinothamar

👍1

All 24 comments

@martinothamar Just to be clear, was the situation such that it started upon first failure and then kept repeating or you would like to avoid this situation in this particular grain that should be never deactivated? I'm afraid the grains are automatically deactivated upon InconsistentStateException, see at https://github.com/dotnet/orleans/issues/1609.

Just in case, can you share the database vendor? This means there was neither an UPDATE or INSERT, which means, if there's state already that the WHERE condition fails for a reason or another. Version (ETag) violation is likely the most common reason.

Let's page @sergeybykov too. I haven't in fact had problems with this and hence haven't paid attention. I should. :)

veikkoeeva on 18 Jun 2018

Yes, it kept repeating. Thing is that I already had a try catch around the block in which I call WriteStateAsync, so the activation is never disabled. I just log the error and move on (rest of the call chain moves on). Wondering what the best practice is here

Database vendor is MSSQL

martinothamar on 18 Jun 2018

@martinothamar Did you try to re-read the state to have the version updated?

veikkoeeva on 18 Jun 2018

No, right now it just logs and forget that error - wasn't expecting to run into this as often as I do. Is there any way to find the root cause of how I get into that state?
If there's a duplicate activation how can I find out and handle it? What other causes are there?

martinothamar on 18 Jun 2018

Recently I had to deal with this issue as well.
For me the trouble came from interleaving code in stateful grains.
In particular orleans timers can be a source of this problem.

As soon as 2 calls end up on WriteStateAsync() at the same time, you have a good chance for this error to happen.

In my opinion ETags should prevent different activations from writing to the same state but the current design doesn't even allow the same grain to write multiple times.

Xanatus on 18 Jun 2018

👍1

Right, one of the grains I've experienced this the most with is in fact triggered by a timer quite often.

Another grain I'm now experiencing this with subscribes to multiple streams (5+) and will write to state upon every message from every stream (maybe this is wrong).

Wouldn't expect interleaving in either case though?

martinothamar on 18 Jun 2018

Stacktrace:

Exc level 0: Orleans.Storage.InconsistentStateException: Version conflict (WriteState): ServiceId=68e89d8e-60d5-43ba-992e-4759d46921345 ProviderName=SqlServerState GrainType=Grains.SomeGrain GrainId=980726 ETag=46.

martinothamar on 18 Jun 2018

Timers execute interleaved. Not sure about streams.

I also wanna add that this issue has been discussed before. #2565
A possible workaround via self-invocation for timers was posted here: #2574

Xanatus on 18 Jun 2018

👍1

Interesting, that definitely solves one of my issues, thanks. I would think stream messages are queued like normal, if not I have som refactoring to do

martinothamar on 18 Jun 2018

👍1

I would think stream messages are queued like normal

They are. Timers are the only interleaving bug that has become a feature that some people now depend on.

sergeybykov on 19 Jun 2018

😕1 😄1

Then I'm not sure what could go wrong. What could cause this exception for a grain functioning roughly like the one below? Is it always due to double activations?

[StorageProvider(ProviderName = "SqlServerState")]
[ImplicitStreamSubscription("Namespace")]
[ImplicitStreamSubscription("Namespace2")]
[ImplicitStreamSubscription("Namespace3")]
[ImplicitStreamSubscription("Namespace4")]
[ImplicitStreamSubscription("Namespace5")]
public class SomeGrain : Grain<SomeGrainState>, ISomeGrain
{
    public async Task OnActivateAsync()
    {
        var sp = GetStreamProvider("SMSProvider");
        var key = GetPrimaryKey();
        await Task.WhenAll(
            sp.GetStream<Thing>(key, "Namespace").SubscribeAsync((msg, _) => DoStuff()),
            sp.GetStream<Thing2>(key, "Namespace2").SubscribeAsync((msg, _) => DoStuff()),
            sp.GetStream<Thing3>(key, "Namespace3").SubscribeAsync((msg, _) => DoStuff()),
            sp.GetStream<Thing4>(key, "Namespace4").SubscribeAsync((msg, _) => DoStuff()),
            sp.GetStream<Thing5>(key, "Namespace5").SubscribeAsync((msg, _) => DoStuff())
        );
    }
    public async Task DoStuff()
    {
        try {
            ....
            await Task.WhenAll(someOtherStream.OnNextAsync(...), WriteStateAsync()); // throws InconsistentStateException
        } catch (Exception ex) {
            // Log exception
        }
    }
}

Hmm looking at it now I'm wondering since we pass delegates which call DoStuff in that way, do I get interleaving here after all? There's a bit of work involving awaiting db calls and other grain calls and such in the .... in the try catch block

martinothamar on 19 Jun 2018

It does look like you'll get multiple WriteStateAsync() calls interleaving because they are at the tails of fan-out calls that get joined via Task.WhenAll().

sergeybykov on 19 Jun 2018

What, does the order of the calls in Task.WhenAll matter? (SomeGrain does not subscribe to someOtherStream if that's what you mean) How do I make DoStuff not interleave?

martinothamar on 19 Jun 2018

Sorry, I think I might have misread the code. If the grain in non-reentrant (default), then an InconsistentStateException could only be thrown if there's more than one WriteStateAsync() call in this try block that can interleave.

try {
            ....
            await Task.WhenAll(someOtherStream.OnNextAsync(...), WriteStateAsync()); // throws InconsistentStateException
}

Unless, of course, I'm missing something here.

sergeybykov on 19 Jun 2018

That's the thing, WriteStateAsync is only called right there, and the grain is non-reentrant

martinothamar on 19 Jun 2018

Hmm... that's puzzling. I wonder if @jason-bragg or @xiazen have some thoughts.

To answer your origianl question (one of them):

so question is what do you do when that happens? Do you have to catch it and DeactivateOnIdle? Do you call ReadStateAsync and lose that data?

If you let InconsistentStateException escape, the grain should get automatically deactivated. That's the simplest solution, essentially equivalent to calling DeactivateOnIdle.

Alternatively, you can choose to refresh state via ReadStateAsync. Just need to make sure there aren't any private variable that would conflict with that state.

sergeybykov on 19 Jun 2018

If you're catching the InconsistentStateException and not rethrowing it, the grain will not deactivate so it will stay in an inconsistent state forever. You either need to reload the state or kill the grain on InconsistentStateException. Otherwise, the pattern looks fine, as far as I can tell.

What stream provider are you using? It shouldn't affect the WriteStateAsync behavior but different stream providers support different recoverability patterns.

jason-bragg on 20 Jun 2018

The SMSProvider currently, what is the reccommended persistent stream these days?

martinothamar on 20 Jun 2018

Thats kind of what I though in terms of deactivate or read state, but then I still dont know why it happens in the first place and the state in that grain is pretty important. It’s supposed to hold and store some last known numerical values, history isnt important. I’d be tempted to just override the ETag stuff but that’s probably not possible in the provider as is

martinothamar on 20 Jun 2018

What is the reccommended persistent stream these days?

Event hub stream provider is the most widely used. If recoverablity is desired, but data loss is acceptable during silo crashes, the memory stream provider could be considered as well.

I still dont know why it happens in the first place

If it happens before the silo joins the cluster it's more likely that duplicate activations will occur. When is the bootstrap logic called in the lifecycle?

Some services have used the subscription manager to setup subscriptions without activating the grain, allowing activity on the stream to activate the instance. Using implicit subscriptions would also allow such behavior.

Taking a second look at the activation logic, subscriptions are created on every activation, which could lead to duplicate subscriptions. I'd advocate checking for existing subscriptions prior to creating new subscriptions. If existings subscriptions exist, resuming processing the stream from the existing subscriptions is recommended. Or minimally removing old subscriptions.

What storage provider are you useing for your pubsub state?
Are streams being generated or consumed from orleans clients, or is this all silo-to-silo streaming?

jason-bragg on 20 Jun 2018

@martinothamar I would be curious to know the reason too. Can you estimate the update frequency?

Regargless of the reason, it seems that the grain will be stuck once it gets a wrong state number. That can happen in some rare situations. Would it be more robust to deacticate or reload in problem situation in any case?

veikkoeeva on 20 Jun 2018

@jason-bragg

Event hub stream provider is the most widely used. If recoverablity is desired, but data loss is acceptable during silo crashes, the memory stream provider could be considered as well.

Thanks! Are there any examples on how to configure the Memory stream provider? I tried looking once but didn't find it.

If it happens before the silo joins the cluster it's more likely that duplicate activations will occur.

I though of this too but this happened when the cluster has been up already for hours, I'm pretty sure the activation was relatively new (minutes).

I'll refactor in resuming of existing streams.

Using memory grain storage for the stream provider with

sOpts.FireAndForgetDelivery = true;
sOpts.OptimizeForImmutableData = true;

It's all silo-to-silo communication.

@veikkoeeva

This grains has to do with sports events (think football/soccer), so the grain is called through streams every time something happens in a sport event.
Update frequency can therefore be atleast less than every second. Every 5 seconds for example.

I'm not sure yet if deactivate or reload would be best yet. Maybe - read, set the values and then write again. I know this can be a little insecure

martinothamar on 20 Jun 2018

Are there any examples on how to configure the Memory stream provider?

No examples. :/ Only test code. https://github.com/dotnet/orleans/blob/master/test/Tester/StreamingTests/MemoryStreamProviderClientTests.cs

The memory streams were developed for dev/test use, allowing developers of persistent streams to run and test their logic locally without each developer having to have their own configured persistent queue (which can get cumbersome and expensive). I am unaware of any team using them in production, but there is no reason they can't be used in production, assuming one understands that all the events are stored in memory so data loss can occur during silo failure.

Recoverability
For some insights into recovering from failures using recoverable streams you may want to check out our stream recoverability test harness and some of the error faults generated in the grain. All of the cases we cover in the tests are from cases found by teams running production services where recovery was needed (using eventhub)

Test Harness:
https://github.com/dotnet/orleans/blob/master/test/Tester/StreamingTests/ImplicitSubscritionRecoverableStreamTestRunner.cs

EventHub tests using stream recoverablity test harness:
https://github.com/dotnet/orleans/blob/master/test/Extensions/ServiceBus.Tests/Streaming/EHImplicitSubscriptionStreamRecoveryTests.cs

Grains simulating recovery error cases:
https://github.com/dotnet/orleans/blob/master/test/Grains/TestGrains/ImplicitSubscription_NonTransientError_RecoverableStream_CollectorGrain.cs
https://github.com/dotnet/orleans/blob/master/test/Grains/TestGrains/ImplicitSubscription_RecoverableStream_CollectorGrain.cs
https://github.com/dotnet/orleans/blob/master/test/Grains/TestGrains/ImplicitSubscription_TransientError_RecoverableStream_CollectorGrain.cs

The above grains and tests use implicit subscriptions, so they may not always check for subscription prior to subscribing (they should, but may not), but you'll need to if you're using explicit subscriptions. The error cases may help you think through some possible error scenarios prior to going into production, especially those related to deduping events.

For recoverable streams, events are delivered from (inclusive) the sequence token provided during subscribe (or resume). Any error returned by the OnNext will trigger the redelivery of the same event until the opperation succeeds or the retry logic gives up, in which case OnError will receive an error telling the grain that the event could not be delivered, in which case the event will be skipped, and the stream will move to the next event, unless the grain takes some action (like subscribing to an earlyer point, unsubscribes, .. ?). This allows a grain to accumulate results for some period, persisting the accumulated state and the sequence token periotically (rather than on each call). This is possible because any error which causes a deactivation of the grain will rewind the stream to the previous token, allowing the reprocessing of the events since the last checkpoint.

jason-bragg on 20 Jun 2018

@martinothamar, akk.. some times my brain just doesn't work.. you're using implicit subscriptions.. For some reason I got it in my head that you were using explicit subscriptions. Please ignore my comments regarding resuming from existing subscriptions and the subscription manager. Implicit subscriptions always have only a single subscription, so you're code is fine. I appologize for any confusion. :/

jason-bragg on 20 Jun 2018

Was this page helpful?

0 / 5 - 0 ratings