Orleans: Stream resubscription issues

Created on 3 Oct 2017 · 35Comments · Source: dotnet/orleans

Every time we redeploy silos clients cannot resubscribe to stream and so we're forced to redeploy them as well. We do resubscribe once we received ClusterConnectionLost event (we retry infinitely). The weird thing that client actually reconnects and can communicate with actors, it's only streams which are broken.

This is what we see in a logs:

System.TimeoutException: Response did not arrive on time in 00:00:10 for message: Request *cli/e1fa0799@64d32ea2->*grn/716E8E94/00000000+udp_theme.task.notifications #6691: global::Orleans.Streams.IPubSubRendezvousGrain:RegisterConsumer(). Target History is: <*grn/716E8E94/00000000+udp_theme.task.notifications:>.
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Streams.StreamConsumer`1.<SubscribeAsync>d__14.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleankka.StreamRef.<Subscribe>d__9.MoveNext() in C:\Work\OSS\Orleankka\Source\Orleankka\StreamRef.cs:line 70
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Youscan.Web.Api.HubClient.<DoSubscribe>d__11.MoveNext() in D:\TeamCity\buildAgent02\work\58c475d9a27a23a9\Source\Web.Api\HubClient.cs:line 143
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Youscan.Web.Api.HubClient.<Resubscribe>d__10.MoveNext() in D:\TeamCity\buildAgent02\work\58c475d9a27a23a9\Source\Web.Api\HubClient.cs:line 127

Source

yevhen

Most helpful comment

Jason, why aren't we resubscribing the client subscribers automatically? Without the client code needing to do anything?

gabikliot on 11 Oct 2017

👍6

All 35 comments

Maybe you have an integration test for this scenario somewhere?

yevhen on 3 Oct 2017

what version of Orleans?

jason-bragg on 3 Oct 2017

@jason-bragg 1.5.1

yevhen on 3 Oct 2017

@yevhen - see also #2226.
After connection lost to the cluster, clients will automatically re-establish their connections to the cluster, but wont resubscribe to streams.
If you don't managed to resubscribe at all, and it's SMSProvider, and your client also produce events - then maybe also related to #982.

shlomiw on 3 Oct 2017

@shlomiw ye, I'm aware of this. I'm trying to re-subscribe to stream once connection to cluster is lost but it fails with timeout. It's SMS stream.

yevhen on 3 Oct 2017

@yevhen - does your client produce event to the SMS stream? do you gracefully shut down when restarting?

If your client does produce event, and you restart without shutting-down gracefully the client, then the PubSubRendezvousGrain will still hold the 'old' producer (the client before the restart) and try to notify it about the new subscriber (the new client), and in this case the PubSubRendezvousGrain grain will hang and you'll get timeout... this is what I saw (see #982).

shlomiw on 3 Oct 2017

@shlomiw no, client only receives events from the stream, events are produced by grains. I don't shutdown the client at all. I do shutdown the server (sometimes gracefully sometimes not) and then alive clients cannot resubscribe to a stream

yevhen on 3 Oct 2017

Oh sorry, I see. Thanks, will keep track. Reliability of the SMS streams is a major concern to us.

shlomiw on 3 Oct 2017

@yevhen

I'd like to make sure I understand your scenario:
You're using Orleans 1.5.1.
You're using SMS streams.
I assume you're using explicit subscriptions that are persisted to a storage provider (not in-memory).
What membership service is configured?

On redeploy of cluster, clients get ClusterConnectionLost notification, and can eventually send grain calls, but can't re-subscribe to streams? All subscribe calls timeout? Even with retries. Is this correct?

jason-bragg on 3 Oct 2017

You're using Orleans 1.5.1.

Yes

You're using SMS streams.

Yes

I assume you're using explicit subscriptions that are persisted to a storage provider (not in-memory).

No. In-memory only. I know that it won't survive silo death but that hasn't been problem. The problem is on the opposite side.

What membership service is configured?

AzureTable

On redeploy of cluster, clients get ClusterConnectionLost notification, and can eventually send grain calls, but can't re-subscribe to streams?

Yes. Yes.

All subscribe calls timeout? Even with retries. Is this correct?

Yes. The problems that it fails at random. I can do 3 deploys of server at the row and everything will be fine, but than on 4th client will fail and won't resubscribe forcing us to redeploy the client as well.

yevhen on 5 Oct 2017

I just want to clarify that client is outside-silo client not the grain inside silo.

yevhen on 7 Oct 2017

We tried with AzureTable PubSubStore and with other providers (AQP2) - no luck. We're periodically pushing test message to this stream (from silo) and log transmission/receive on both sides. Everything works as expected unless we redeploy silos without redeploying clients too.

Even if client re-subscribes successfully it then doesn't receive ping message. That's mistery. We thought it could be a serialization issue - but there no warnings/errors, and without changing anything if we redeploy client - it begins to work as expected.

At that point I gave up on streams and returned back to observers. Hope that issue will be reproduced and fixed soon so I can try streams again.

yevhen on 7 Oct 2017

Jason, why aren't we resubscribing the client subscribers automatically? Without the client code needing to do anything?

gabikliot on 11 Oct 2017

👍6

@yevhen
Sorry for the delayed reply. I tried to repro this with a simple single client/silo setup, but couldn't reproduce it.
I'll try again with multi-client/multi-silo when time allows.

@gabikliot
Automatic re-subscribe is non-trivial and we're (slowly) adding the features needed to support this, like cluster connection lost events (and cluster connected events). At this time I'm trying to identify and reproduce an issue with the currently expected behavior. The advanced scenarios like automatic re-subscribe won't work if these foundational systems are not solid.

jason-bragg on 11 Oct 2017

@jason-bragg could you pls share the gist of the code you are using when resubscribing (how you handle cluster connection lost event)? Or just share repo with test app.

yevhen on 12 Oct 2017

👍1

Sure.
This is super primitive.. but here goes :)

using Orleans 1.5.1

I run the client and silo from console in different windows. When I kill the silo the client will write errors to console while trying to reconnect. Once silo is restarted, client will reconnect (without restart) and streaming will begin again. I can kill the silo as often as I want and behavior never seems to change.

Client

namespace StreamingClient
{
    class Program
    {
        private static bool connected;

        static void Main(string[] args)
        {
            // Then configure and connect a client.
            var clientConfig = ClientConfiguration.LocalhostSilo();
            clientConfig.DeploymentId = "blarg";
            clientConfig.GatewayProvider = ClientConfiguration.GatewayProviderType.AzureTable;
            clientConfig.DataConnectionString = **_"UseYourOwn"_**;
            clientConfig.AddSimpleMessageStreamProvider("SMS");
            var client = new ClientBuilder()
                .UseConfiguration(clientConfig)
                .AddClusterConnectionLostHandler(ConnectionToClusterLostHandler)
                .Build();
            Run(client).Wait();
        }

        private static async Task Run(IClusterClient client)
        {
            await client.Connect();
            while (true) await Stream(client);
        }

        private static async Task Stream(IClusterClient client)
        {
            connected = true;

            var stream = client.GetStreamProvider("SMS").GetStream<int>(new Guid(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), "Blarg");
            StreamSubscriptionHandle<int> handle = null;
            do
            {
                try
                {
                    handle = await stream.SubscribeAsync(OnNext);
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"Subscribe failed with error: {ex}.");
                    handle = null;
                    await Task.Delay(100); // prevent free spin
                }
            } while (connected && handle == null);
            Console.WriteLine("Subscribed to stream.");

            int i = 0;
            while(connected)
            {
                await Task.Delay(100);
                try
                {
                    await stream.OnNextAsync(i++);
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"Send event failed with error: {ex}.");
                }
            }
        }

        private static void ConnectionToClusterLostHandler(object sender, EventArgs e)
        {
            Console.WriteLine($"Lost connection to cluster.");
            connected = false;
        }

        private static Task OnNext(int i, StreamSequenceToken token)
        {
            Console.WriteLine($"Received {i}.");
            return Task.CompletedTask;
        }
    }
}

Silo

namespace StreamingSiloHost
{
    class Program
    {
        static void Main(string[] args)
        {
            var siloConfig = ClusterConfiguration.LocalhostPrimarySilo();
            siloConfig.Globals.DeploymentId = "blarg";
            siloConfig.Globals.LivenessEnabled = true;
            siloConfig.Globals.LivenessType = GlobalConfiguration.LivenessProviderType.AzureTable;
            siloConfig.Globals.DataConnectionString = **_"UseYourOwn"_**;
            siloConfig.AddSimpleMessageStreamProvider("SMS");
            siloConfig.AddMemoryStorageProvider("PubSubStore");
            //siloConfig.Globals.ClientDropTimeout = TimeSpan.FromSeconds(5);
            var silo = new SiloHost("StreamingSilo", siloConfig);
            silo.InitializeOrleansSilo();
            silo.StartOrleansSilo();

            Console.WriteLine("Silo started.");
            Console.WriteLine("\nPress Enter to terminate...");
            Console.ReadLine();

            // Shut down
            silo.ShutdownOrleansSilo();
        }
    }
}

_Differences from your service pattern._
Producer is client, not grain. I need to create grain producer to mimic your service pattern.
Single silo and client. I need at least two silos to mimic your service pattern.

Any other differences you see?

jason-bragg on 12 Oct 2017

Trying to help Shlomi and Evgeny here.

Jason: in your repro you are not resubscribing from the ConnectionToClusterLostHandler, while they probably do. That could be one difference.

More importantly: that ConnectionToClusterLostHandler can not be used to reliably detect cluster disconnections. Due to how it is implemented (races between various gateways lists on the client), client may never receive this event (disconnects and reconnects at various orders). I am pretty sure I explained that when this feature was implemented. This event is useful for diagnostics, but not for reliable resubscription. You need to look at the whole problem holistically and end to end and come up with an algorithm that will make sure we always resubscribe, no matter what is the order of silos disconnects and reconnects.

I have multiple ideas of how this could be done. Here is one:
1) client caches all subscriptions.
2) every time a client connects to a new gateway, after connecting to it, it resubscribes them all. to make this work correctly, you need to detect duplicate subscriptions in the PubSub and just ignore duplicates. I think this already happens now, since we have subscription id and store them in a map.

So far you got 100% always CORRECT implementation, I think. Lets convince ourselves it is indeed. Lets says it is.
There is still a problem: this solution is not efficient, since you will be resubscribing the same stuff lots of time. Will work for a small number of subscriptions, but not for large.

So first you implement this and unblock customers (makes me very sad to read "At that point I gave up on streams and returned back to observers" and "Reliability of the SMS streams is a major concern to us").

Next we think how to optimize this. We can optimize it pretty heavily I think and I have more ideas on that.

I am surprised I still remember all that stuff.

gabikliot on 12 Oct 2017

Oops. hit close instead of comment. :/ Reopening

jason-bragg on 12 Oct 2017

Trying to help Shlomi and Evgeny here.

Thank you. Very much appreciated.

not resubscribing from the ConnectionToClusterLostHandler

Fair point. I should use that. Though, that doesn't seem to be at the heart of @yevhen's issue. What I'm trying to determine is: Why is streaming failing once client has reconnected?

I am surprised I still remember all that stuff.

I'm not :)

jason-bragg on 12 Oct 2017

@gabikliot

Not resubscribing from the ConnectionToClusterLostHandler

I wrote that code last week and didn't really recall how it all worked, but it does use ConnectionToClusterLostHandler to trigger the re-subscribe logic. The handler sets connected to false, which breaks out of the send loop and triggers the re-subscribe logic.

jason-bragg on 12 Oct 2017

@jason-bragg @gabikliot Could you guys please summarise whether or not ConnectionToClusterLostHandler can be reliably used as a handler for re-subscribing?

@gabikliot in your comment you're saying it can't. Could you please elaborate on why or share a link to where that was discussed?

every time a client connects to a new gateway

This sounds like a good solution, but I cannot see a way to know on the client when that happens. Would you be able to share a rough example of how that could be achieved?

ilyalukyanov on 28 Oct 2017

This issue is hitting us in production very badly. I'm doing something similar to @yevhen:

On start, an out-of-silo client subscribes to a stream and activates a grain that feeds "heartbeat" messages to the stream.
The client doesn't publish any messages to the stream itself - only receives.
If it doesn't receive any heartbeats in a while, it unsubscribes and then subscribes again.

Streams are configured with Azure Queue provider and PubSub storage is in Azure BLOB Storage.

Under some circumstances that I've been trying to recreate, but not very successfully, I run in the same issue: the client never receives a message from the stream and is continuously resubscribing. Interestingly, heartbeats start to pile up in the queue so that the queue is never drained.

What I've observed:

The issue is reproducible with 1.5.2 and 2.0.0-beta3.
Number of silos seem to matter. I think it's essential to have more than 1 silo in order to run into the issue.
The issue happens in 2/3 cases when I deploy silos and clients simultaneously.
Recently I reproduced it by just restarting a silo (responsible for the stream) without restarting anything else.
When the issue happens, PubSub state contains a "zombie" consumer (at least when it happens during those simultaneous deployments), i.e. if I have 2 clients, there would be more than 2 consumers.

I've been trying to reproduce it in a local environment and was able to do it 2 out of ~20 attempts. What I was doing:

I launched 2 silos and 1 client.
All used Azure Storage Emulator.
I was restarting silos one after another so that at least one at a time is alive. I was checking if it was "stuck" after each restart by looking in the queue (if the number of messages grows, then it happened).

There seems to be some kind of race-ish thing going on, i.e. it depends on at what point of silo initialisation re-subscription to an alive silo happens.

ilyalukyanov on 8 Feb 2018

I wonder if this is a manifestation of #1069.

sergeybykov on 8 Feb 2018

Interestingly, heartbeats start to pile up in the queue so that the queue is never drained.

This would seem to indicate that the pulling agent is either continually failing to deliver those events or their is no pulling agent running for that queue.

When this issue occures, are you seeing errors in the logs and if so, what are they?

jason-bragg on 8 Feb 2018

@sergeybykov I agree looks very related, although I use Azure Table for Membership. I'd say it's closer to #1031 (referenced in #1069) as I can see similar errors in logs.

@jason-bragg I see a lot of stuff happening in logs, but not sure to what extent those errors are related to the issue. Please find below those I think are related:

There's usually a number of these in silo logs:

2018-02-04T12:13:52.3485426Z [Runtime.Messaging.IncomingMessageAgent/System] DEBUG production orleans-silo  MessageFactory {LoggerType=Runtime, log4net:UserName=OS2PROD\admin, IPEndPoint=172.21.4.55:11111, log4net:Identity=, EventCode=100000, log4net:HostName=OS2PROD} - Creating Unrecoverable rejection with info 'SystemTarget PullingAgentSystemTarget/FF/79c6218e not active on this silo. Msg=Response S172.21.4.55:11111:255083799*cli/46896996@59e71d72->S172.21.4.55:11111:255083799PullingAgentSystemTarget/FF/79c6218e@S79c6218e #205: global::Orleans.Streams.IStreamConsumerExtension:GetSequenceToken()' for Orleans.Runtime.MessageFactory at:
   at Orleans.Runtime.MessageFactory.CreateRejectionResponse(Message request, RejectionTypes type, String info, OrleansException ex)
   at Orleans.Runtime.Messaging.IncomingMessageAgent.ReceiveMessage(Message msg)
   at Orleans.Runtime.Messaging.IncomingMessageAgent.Run()
   at Orleans.Runtime.AsynchAgent.AgentThreadProc(Object obj)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart(Object obj)

and these:

2018-02-04T12:13:52.3515470Z [Runtime.Messaging.IncomingMessageAgent/System] WARN  production orleans-silo  Runtime.Messaging.IncomingMessageAgent/System {LoggerType=Runtime, log4net:UserName=OS2PROD\admin, IPEndPoint=172.21.4.55:11111, log4net:Identity=, EventCode=101009, log4net:HostName=OS2PROD} - Received a message Response S172.21.4.55:11111:255083799*cli/46896996@59e71d72->S172.21.4.55:11111:255083799PullingAgentSystemTarget/FF/79c6218e@S79c6218e #205: global::Orleans.Streams.IStreamConsumerExtension:GetSequenceToken() for an unknown SystemTarget: S172.21.4.55:11111:255083799PullingAgentSystemTarget/FF/79c6218e@S79c6218e

also these on the client:

2018-02-04T12:13:52.3675833Z [7] WARN  production my-client  OutsideRuntimeClient {log4net:HostName=II1PROD, IPEndPoint=172.21.4.34:0, LoggerType=Runtime, log4net:Identity=, log4net:UserName=II1PROD\admin, EventCode=100011} - No callback for response message: Unrecoverable Rejection (info: SystemTarget PullingAgentSystemTarget/FF/79c6218e not active on this silo. Msg=Response S172.21.4.55:11111:255083799*cli/46896996@59e71d72->S172.21.4.55:11111:255083799PullingAgentSystemTarget/FF/79c6218e@S79c6218e #205: global::Orleans.Streams.IStreamConsumerExtension:GetSequenceToken()) Response S172.21.4.55:30000:0PullingAgentSystemTarget/FF/79c6218e@S79c6218e->*cli/46896996@59e71d72 #205: global::Orleans.Streams.IStreamConsumerExtension:GetSequenceToken()

ilyalukyanov on 8 Feb 2018

@ilyalukyanov Thanks for the log lines, those lines support Sergey's suspicions. Calls to the pulling agent do not make it to the silo where it lives because it's a system target. If you add a gateway to all silos, does the issue resolve itself.

jason-bragg on 9 Feb 2018

@jason-bragg I agree, looks like my issue is at least related if not the same. However, all my silos are identical and all have a gateway already.

I'm only doing one non-standard thing. As a step in gracefully terminating a silo, I remove its records from Azure membership table directly (after the silo is shut down). I needed that with an older version of Kubernetes (before Stateful Sets were added) to prevent Membership Azure table from bloating.

That behaviour is no longer needed so I'm planning to try without it, but I'm pretty sure I was reproducing the issue just by simultaneously starting silos and clients. With and without docker.

ilyalukyanov on 10 Feb 2018

@jason-bragg I've removed all non-standard stuff and as I thought manage to reproduce the issue again.

What I was doing was:

Start 2 silos, both with a gateway. Waited for all silos to fully start.
Started 2 clients. They connected and started to feed heartbeats through a stream.
I restarted (with graceful shutdown) the silo that was responsible for the stream (let's call it Silo1).
Silo1 successfully restarted and clients successfully reconnected to the stream.
I repeated that again with the 2nd silo as it took responsibility over the stream (let's call it Silo2).
Silo2 successfully restarted, clients reconnected and the issue happened.

I was doing testing in Kubernetes with Azure Table Membership provider. Silos start one after another. When a silo is restarted it may or may not get a different IP. Thus restart is essentially: destroy the old silo and create a new one.

What may be interesting or as well completely random, when I tried second time the record in the membership table for the previous instance of Silo2 had been still showing ShuttingDown status when new record for its new record was added with status Alive. It transitioned to Dead shortly after.

Next what I was experimenting with how to recover from the issue:

I restarted Silo2 again. Made no difference.
I restarted Silo1 and that resolved the issue and streaming got back to normal.

I have two theories from that observation:

When a silo is restarted, the issue happens on/with another silo in the cluster.
Since we have two participating components: a producer grain and a consumer client and pulling agent pair, the issue depends on how the agent and the grain are placed across different silos. E.g. producer somehow keeps delivering to a dead consumer or smth like that.

I'll try to do more experiments in these two directions.

ilyalukyanov on 12 Feb 2018

@ilyalukyanov
Thanks for the information. I've a theory on this, but I've not had the time to investigate. Maybe you can help confirm.

_[Edit] - The below theories are flawed. When pulling agents are moved, they get new Id's, so any references to old id will fail, but all new operations should use the new ID's and work._

_Theory_
When clients make calls to the cluster they pick a gateway for each target (grain or system target), and this mapping does not change unless the gateway selected goes away. When using a persistent stream provider (like Azure queue) each silo must be specified as a gateway because messages to system targets are not routed. So if a pulling agent is moved (as may happen when a new silo joins the cluster) any clients that were talking to it will continue to send messages to the gateway where the silo used to be. These messages will not be routed and will cause streaming to fail.

_Workaround_
Try using a static queue balancer. StaticClusterConfigDeploymentBalancer should work. The behavior you should see with this is that while a silo is down, the streams from it's queues will not be sent, but when it comes back, it should begin to work.

jason-bragg on 13 Feb 2018

@jason-bragg Thanks for your advice, I'll definitely investigate it.

So far I've tried something slightly different. I switched queue balancer to DynamicClusterConfigDeploymentBalancer, which didn't fix the problem (which aligns with your theory), but brought some improvement.

I understand how StaticClusterConfigDeploymentBalancer is going to workaround the issue, but ultimately it doesn't suit our application as it runs in Kubernetes, which is inherently dynamic. When a silo dies, it gets replaced with a brand new silo. The dead one remains dead permanently. So I thought since both balancers are subclasses of DeploymentBasedQueueBalancer, I'd try DynamicClusterConfigDeploymentBalancer first, which had to fail if your theory is right.

It did fail, but did it a little better than with ConsistentRingQueueBalancer. I was seeing same log messages, but:

Stream messages were still going through the queue after clients resubscribed. Clients were successfully receiving most of them.
The queue didn't get stuck. Messages weren't continuously piling up. Instead, the queue's size was slowly fluctuating in the range of ~0-20 messages periodically failing deliveries. After a while - most of the time around 0. This is still something to confirm, but that was either because messages were eventually delivered after a subsequent attempt or dropped on timeout.

Thus, it's already an improvement as instead of fully loosing streaming, we're now loosing occasional messages (but regularly).

In order to continue the experiment I'm still going try with StaticClusterConfigDeploymentBalancer locally to fully validate your theory. I'll share results here.

If the theory is confirmed, do you consider that behaviour as something to be improved in Orleans, or something that should be taken care of outside of its codebase? I'd be happy to contribute, but would need some support with designing the fix as I'm not super familiar with the code.

Another workaround I'm going to try is to dispose instances of orleans client and reconnect before resubscribing to streams so that clients don't carry over any references to dead gateways and pulling agents. Do you think that could work? Can you think of any side effects of doing that?

ilyalukyanov on 14 Feb 2018

@jason-bragg I've managed to get some time to test with StaticClusterConfigDeploymentBalancer and unfortunately it didn't make any difference, which is a bit surprising.

I've tested on a local machine: 2 silos + 2 clients. Silos were listening on different ports. Same scenario: restart one of the silos and observe what's happening in logs and the queue.

ilyalukyanov on 5 Mar 2018

@ilyalukyanov That's surprising. With the static queue balancer, the queues should always be on the same silos. Maybe theirs something I'm not understanding in how the client handles adding/removing gateways..

do you consider that behavior as something to be improved in Orleans, or something that should be taken care of outside of its codebase?

This is the result of a bug in our routing of messages from clients to system targets. A more detailed description can be found in "Orleans clients using PersistentStreamProviders must have configured gateways for every silo in the cluster. #1069". This is definitely something we should fix, it’s just been rare enough that it’s never taken priority.

I'd be happy to contribute but would need some support with designing the fix as I'm not super familiar with the code.

Understood. Your assistance would be welcome. Relatively few people are familiar with how client messages are routed, especially for system targets. I’d suggest taking a look at comments in #1069 from @gabikliot and myself.

I was hoping we could find a workaround that unblocked you, but I'm not optimistic of that outcome at this point. To use the streaming feature as your attempting (which is well within what is expected) I think #1069 will need addressed.

jason-bragg on 5 Mar 2018

@jason-bragg

That's surprising. With the static queue balancer, the queues should always be on the same silos. Maybe theirs something I'm not understanding in how the client handles adding/removing gateways..

I haven't explored the code deeply enough to answer this myself, but what if the issue isn't in inability of gateways to route as such, but in how SystemTargets get their IDs. I.e. in the address S172.21.4.55:11111:255083799PullingAgentSystemTarget/FF/79c6218e@S79c6218e the last alpha-numeric part feels like something that might not be constant. What if when a silo restarts and a pulling agent reactivates, that part changes? Or is that some kind of consistent hash of something? If yes, then a hash of what and could that change?

Another guess is that since after a restart the value of Generation changes and the address contains the old value, it no longer can be matched by that address, i.e. the lookup doesn't tolerate change of Generation.

Those are just blind guesses and I'm going to validated them in the code. Just brain-dumping them here for now in case you know answers from the top of your head.

This is the result of a bug in our routing of messages from clients to system targets. A more detailed description can be found in "Orleans clients using PersistentStreamProviders must have configured gateways for every silo in the cluster. #1069". This is definitely something we should fix, it’s just been rare enough that it’s never taken priority.

Thanks for confirming and giving me a direction. I think I looked through that issue before, but will explore it carefully in the new context.

What I'm also curious about is why this is rare. As it looks to me, everybody who's using persistent streams on more than one silo in a production system should suffer from it. Azure Queues, EventHub and SQL seem to be quite popular and probably together are more than 50% of all stream providers being used. Where I'm leading to is maybe there's a way to run our cluster that's not vulnerable to this issue I'm just not aware of?

I was hoping we could find a workaround that unblocked you, but I'm not optimistic of that outcome at this point. To use the streaming feature as your attempting (which is well within what is expected) I think #1069 will need addressed.

I very appreciate all your help and the work you guys are doing. We recently migrated our production system to 2.0.0-beta3 and have been running it for a couple of weeks so far. It's very good, significant improvement and helped to solve so many issues we were having on 1.5.3. Streaming is really the last thing to resolve.

What I'm thinking of as a workaround is to drop persistent streams and use Simple Message Stream Provider. After reviewing our code recently I don't think we really benefit from persistent streams. The only thing is back pressure, which we would need to work around, but that seems like the least of two evils.

ilyalukyanov on 5 Mar 2018

I've quickly tried to validate one of my random theories by just reading the code and think I might have found something.

This is where a SystemTarget is registered - PersistentStreamPullingManager.AddNewQueues:

var agentId = GrainId.NewSystemTargetGrainIdByTypeCode(Constants.PULLING_AGENT_SYSTEM_TARGET_TYPE_CODE);
var agent = new PersistentStreamPullingAgent(agentId, streamProviderName, providerRuntime, this.loggerFactory, pubSub, queueId, this.options);
providerRuntime.RegisterSystemTarget(agent);

There agentId is a function of a random guid - GrainId.NewSystemTargetGrainIdByTypeCode:

internal static GrainId NewSystemTargetGrainIdByTypeCode(int typeData)
{
    return FindOrCreateGrainId(UniqueKey.NewSystemTargetKey(Guid.NewGuid(), typeData));
}

That's our GrainId. Then in IncomingMessageAgent.ReceiveMessage, where the issue is logged:

SystemTarget target = directory.FindSystemTarget(msg.TargetActivation);
if (target == null)
{
    MessagingStatisticsGroup.OnRejectedMessage(msg);
    Message response = this.messageFactory.CreateRejectionResponse(msg, Message.RejectionTypes.Unrecoverable, String.Format("SystemTarget {0} not active on this silo. Msg={1}", msg.TargetGrain, msg));
    messageCenter.SendMessage(response);
    Log.Warn(ErrorCode.MessagingMessageFromUnknownActivation, "Received a message {0} for an unknown SystemTarget: {1}", msg, msg.TargetAddress);
    return;
}

It's looked up by TargetActivation, which should be matching the id it's being registered with -
ActivationDirectory.RecordNewSystemTarget:

systemTargets.TryAdd(target.ActivationId, target);

However, for SystemTarget-s ActivationId is pretty much GrainId according to ActivationId. GetSystemActivation:

public static ActivationId GetSystemActivation(GrainId grain, SiloAddress location)
{
    if (!grain.IsSystemTarget)
        throw new ArgumentException("System activation IDs can only be created for system grains");
    return FindOrCreate(grain.Key);
}

Thus, since IDs are stored in memory, the Silo always starts having none of them and a new random id gets generated for the SystemTarget. Then, when messages come for the old ID, we get the error as the agent is now registered under a new ID.

This leads me to the following two questions:

My another guess is that's because the old ID gets stuck somewhere either In PubSub or In the Client or somewhere else. It doesn't look like the problem is in the client as restarting them didn't help if I remember correctly. The question is: why and where that's happening? That's the next thing I'll try to find out.
There's probably a lot more going on with those IDs in Orleans in general, but if I'm not missing anything, if we achieve consistent IDs for those SystemTargets regardless of what silo they are on (e.g. a function of a queue name, not a random guid), if all silos have pulling agents for all queues, the issue shouldn't happen anymore. I'm going to try this locally, but keen to get a qualified opinion on this.

ilyalukyanov on 6 Mar 2018

@ilyalukyanov
The idea that this may be related to id management for the pulling agent system targets rather than a routing issue is interesting, as, at least to my knowledge, we've never demonstrated that this is necessarily a routing issue. That is just a theory at this point. What has been demonstrated was that persistent streams do not work unless there is a gateway configured on all the silos in the cluster. This shortcoming exists for regular use, not alone the recoverability scenarios you're exploring. It was this observation that led to the suspicion that there is a routing problem. I am of the opinion that this does strongly point to a routing problem, but that does not mean that the routing issue is the only problem.

Regarding pulling agent id management, the grain based pubsub system does keep track of the ID's of stream publishers, which the pulling agent is, so the grains do store the pulling agent ids. However, if a pulling agent is not active, the pubsub grains should recognize this and clean up the dead producers, without the error propagating to the application layer. I encourage independent examination of this logic, so please don't take my word that this works as intended. The logic that is intended to handle this is in PubSubRendezvousGrain.ExecuteProducerTask. The error OrleansMessageRejectionException is the error we expect to receive when a pulling agent is no longer available.

Looking at that error case, it looks like we may be too strict, as it looks like we not only check to see if the faulty producer is a system target but also check to see that the silo it was on is no longer available. The silo availability check would only be valid if pulling agents were only shutdown when their silos were, which is not the case. I suspect this is left over from earlier expectations when queues did not move at runtime. This looks wrong, but I need to think about it a bit. Since producer id's are stored, it's possible for the grain to be hydrated on a silo that is not yet fully connected to the cluster, in which valid pulling agents may not be reachable. I'm not saying this is the case, only that the original check was likely added for a reason, and I'm hesitant to remove it without a clearer understanding of it's original purpose. Maybe @gabikliot can shed some light here..

jason-bragg on 9 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings