Signalr: With Redis scale-out, once Send() method runs into an error, the stream will never be opened again until Redis is restarted

Created on 19 Jun 2014 · 27Comments · Source: SignalR/SignalR

With Redis scale-out, once Send() method runs into an error, if the instance doesn't receive any message, then in the instance the stream will never be opened again until Redis is restarted. No connections in the instance can send a message. Instead they return the same error. Restarting Redis fixes the issue.

Repro steps:

1). Use AspNet SignalR sample, update it to use Redis scale-out, just let one web app use the Redis, start the Redis
2). Since the issue not easy to repro, I add below code in the RedisMessageBus.Send() method to throw when message contains “throw”:

        protected override Task Send(int streamIndex, IList<Message> messages)
        {
            var message = messages[0].GetString();
            if (message.Contains("throw"))
            {
                throw new Exception("Verify Throwing an exception from Send()");
            } 
            . . . . . . 
        }

3). Browse to Hubs/HubConnectionAPI/Default.aspx page
4). Input "throw" in "To Everybody" textbox, Click button "Broadcast". This will cause Send method throw.
5). Input "aaaa" in "To Everybody" textbox, Click button "Broadcast". Observe " Error: Verify Throwing an exception from Send()"
6). Wait for 1 minute or more, click "Broadcast" button again, Repeat this step several times.

Expected result:

Like SB scale-out /SqlServer scale-out, "Broadcast" can send the messages successfully.

Actual result:

"Error: Verify Throwing an exception from Send()" was always thrown for sending the messages which should not cause Send method throw in step (6).

Bug

Source

Xiaohongt

Most helpful comment

We're hitting this issue at least once a week in our production scenario deployed to 4 Azure webroles when a scaling operation creates new webroles. In our case restarting the webrole (not the Redis server) addresses the issue.

Is there a workaround to prevent this issue from happening with the current SignalR release (2.2.0) ? or could this issue be addressed on the next release?

reymarx on 6 Sep 2015

👍2

All 27 comments

@Xiaohongt try reproing this by throwing an error in the Send method in RedisMessageBus and see if the subsequent sends reach that point. I did that and it seems to work.

abnanda1 on 20 Jun 2014

Note:

1). This repro with one instance, this also repro in old Redis scale-out.
2). This doesn’t repro in SB scale-out /SqlServer scale-out with one instance.
e.g. In SB scale-out, when sending message failed, no message can be received in SB scale-out, SB scale-out Receiver will wait 60 seconds, then it will Open the stream.

Xiaohongt on 25 Jun 2014

@Xiaohongt can you also try it by returning a faulted task in Send, rather than throwing directly? I'd be interested if the behavior is the same.

DamianEdwards on 25 Jun 2014

List one actual exception from StackExchange which I saw one time before when sending messages always returned this actual error (this error is not easy to repro):

SignalR.ScaleoutMessageBus Error: 0 : Stream(0) - Send failed: System.AggregateException: One or more errors occurred. ---> System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'XIAOTA01-01'.
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteAsyncImpl[T](Message message, ResultProcessor`1 processor, Object state, ServerEndPoint server) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\ConnectionMultiplexer.cs:line 1629
   at StackExchange.Redis.RedisBase.ExecuteAsync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line 72
   at StackExchange.Redis.RedisDatabase.ScriptEvaluateAsync(String script, RedisKey[] keys, RedisValue[] values, CommandFlags flags) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisDatabase.cs:line 781
   at Microsoft.AspNet.SignalR.Redis.RedisMessageBus.Send(Int32 streamIndex, IList`1 messages) in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Redis\RedisMessageBus.cs:line 62
   at Microsoft.AspNet.SignalR.Messaging.ScaleoutStreamManager.Send(Object state) in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Core\Messaging\ScaleoutStreamManager.cs:line 86
   at Microsoft.AspNet.SignalR.Messaging.ScaleoutStreamManager.<Send>b__0(Object state) in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Core\Messaging\ScaleoutStreamManager.cs:line 71
   at Microsoft.AspNet.SignalR.Messaging.ScaleoutStream.SendContext.InvokeSend() in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Core\Messaging\ScaleoutStream.cs:line 320
   --- End of inner exception stack trace ---
---> (Inner Exception #0) System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'XIAOTA01-01'.
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteAsyncImpl[T](Message message, ResultProcessor`1 processor, Object state, ServerEndPoint server) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\ConnectionMultiplexer.cs:line 1629
   at StackExchange.Redis.RedisBase.ExecuteAsync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line 72
   at StackExchange.Redis.RedisDatabase.ScriptEvaluateAsync(String script, RedisKey[] keys, RedisValue[] values, CommandFlags flags) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisDatabase.cs:line 781
   at Microsoft.AspNet.SignalR.Redis.RedisMessageBus.Send(Int32 streamIndex, IList`1 messages) in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Redis\RedisMessageBus.cs:line 62
   at Microsoft.AspNet.SignalR.Messaging.ScaleoutStreamManager.Send(Object state) in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Core\Messaging\ScaleoutStreamManager.cs:line 86
   at Microsoft.AspNet.SignalR.Messaging.ScaleoutStreamManager.<Send>b__0(Object state) in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Core\Messaging\ScaleoutStreamManager.cs:line 71
   at Microsoft.AspNet.SignalR.Messaging.ScaleoutStream.SendContext.InvokeSend() in d:\dd\SignalR_Dev_

Xiaohongt on 25 Jun 2014

*update
@Xiaohongt so you should not catch the exception if you want to repro the faulted task scenario. How did you repro this issue in SB and SQL scaleout? were you throwing an exception directly or were you returning a faulted task.

abnanda1 on 26 Jun 2014

Right. Here is updated Send() method in RedisMessageBus as below to return a faulted task when message contains “faulted”.

The behavior is same, the subsequent sends return the same error:

1). Input "faulted" in "To Everybody" textbox, Click button "Broadcast", client will receive the error "Verify faulted Task from Send()"
2). Input "aaaa" in "To Everybody" textbox, Click button "Broadcast", client will receive the same error

        protected override async Task Send(int streamIndex, IList<Message> messages)
        {
            var message = messages[0].GetString();
            if (message.Contains("throw"))
            {
                throw new Exception("Verify Throwing an exception from Send()");
            }

            if (message.Contains("faulted"))
            {
                return Task.Run(() =>
                {
                        throw new Exception("Verify faulted Task from Send()");
                 });
            }
            else
            {
                 . . . . . 
            }
        }

Xiaohongt on 27 Jun 2014

In SB and SQL scale-out, I directly throw an exception in Send(), for SB scale-out, after 60 seconds the subsequent sends can succeed, for SQL scale-out after shorter than 60 seconds the subsequent sends can succeed.

Xiaohongt on 27 Jun 2014

I have tried repro-ing this issue for Redis and SB. As, @Xiaohongt mentioned earlier, in Redis once we set the error after an exception - we don't set set-it back to null. As a result all the subsequent sends for that particular stream fail. On the other hand for SB, after an exception, OnMessage is triggered and that re-opens the faulted stream allowing the subsequent sends to succeed. (I am still unsure about why OnMessage is being triggered in SB after an exception.)

abnanda1 on 27 Jun 2014

After discussion, we plan to fix this in v3 as well as removing QueuingBehavior.Always in v3.

Xiaohongt on 5 Aug 2014

Is there a workaround to prevent this issue from happening with the current SignalR release (2.2.0) ? or could this issue be addressed on the next release?

reymarx on 6 Sep 2015

👍2

I can see the error also, with SignalR.
My other connections to signalR works.

(edited, the first investigation was wrong)

MosheL on 8 Nov 2015

I also am hoping to see a 2.2.0 workaround or some way to mitigate this.

circuitrider on 10 Dec 2015

I tnink that the workaround can be try/catch in any the PersistentConnection and try to not fail.
for example - send data with MVC /ASHX and not within signalr.

MosheL on 13 Dec 2015

had this issue this morning on 2.2, no hope of a fix ? this is pretty detrimental issue.

AnQueth on 23 Jun 2016

👍1

the problem is coming from never clearing the _error variable if the StreamState doesn't change. I'm assuming there is a race condition going on with the settings and clearing of this _error variable and state changing. I ran verbose mode of stack exchange to make sure it was reconnecting and it is. once I went into the Open method of scaleoutstream.cs and moved the _error = null to before the ChangeState if statement and it appears to work correctly now when stackexchange reconnects.

Is there a reason this approach wouldn't work or be accepted as a fix for this rather big issue in signalr that is really effecting production environments?

AnQueth on 13 Sep 2016

The seems to be a race condition. The theory here is that an error triggered a reconnect to Redis. Reconnect happens on one thread and it clears the _error variable. However a failing Send on a different thread that used the old connection (i.e. from before a successful reconnect) is not aware of the reconnect and is resetting the _error back to the error which triggered the reconnect. At this point we have an open and working connection and the _error set. Since the connection is open and working nothing is trying going to clear the _error (it is only cleared on opening a connection) and the _error is being rethrown each time Send is called.

moozzyk on 24 Sep 2016

I no longer think it's possible for RedisMessageBus.ConnectWithRetry() to dispose the IRedisConnection (and therefore the ConnectionMultiplexer which is the source of the ODE) without the RedisMessageBus moving to the disposed state itself.

My new theory is that RedisMessageBus.ConnectToRedisAsync() succeeds (therefor preventing a retry), but the OnConnectionRestored callback is called before ConnectToRedisAsync() returns thereby moving the RedisMessageBus into the Connected state prematurely. This unexpected state would cause ConnectWithRetry() to call Shutdown() without any call to Dispose() and exit the retry loop. This in turn would dispose the ConnectionMultiplexer explaining the ODE.

It's tough to verify this without a repro. If we could get a memory dump of an app in the failing state, we could at least verify the RedisMessageBus is in the Disposed state but not the DefaultDependencyResolver indicating that RedisMessageBus.Dispose() was never called.

halter73 on 24 Sep 2016

👍1

There are 2 scenarios, one with object disposed which is very rare, i've only seen it once and the more common one is the _error not being cleared but the connection multiplexer saying all is connected and working.. I've done a lot of windbg analysis on this issue. The best way to recreate it is to put an huge load on the server (which can start effecting tcp connections and such) and then attach a windbg when the issue is happening.

AnQueth on 26 Sep 2016

@Quethrosar - any chance you could share a dump?

moozzyk on 26 Sep 2016

I guess it would be possible for the _error field not to get reset anytime there is an error originating from ScriptEvaluateAsync without a subsequent invocation of the OnConnectionRestored callback. This could be due to a race, or even just an misunderstanding of the behavior of the ConnectionMultiplexer API.

@Quethrosar Since the ODE is rare, what is the more common exception(s) not cleared from _error? A dump would also be super helpful of course.

halter73 on 26 Sep 2016

SciptEvaluateAsync called from OnConnectionRestored sometimes failed before 2.2.1 where there was no value in the cache (https://github.com/SignalR/SignalR/issues/3436)

moozzyk on 26 Sep 2016

i'll try and get a memory dump for you but here's the usual call stack from our logging.

in our case it seems to be SocketFailure on eval. I have verified that stackexchange reconnects. i also verified killing the client connection from redis itself also fixes the issue although this is not acceptable.

StackExchange.Redis.RedisConnectionException: SocketFailure on EVAL
   at Microsoft.AspNet.SignalR.Messaging.ScaleoutStream.Send(Func`2 send, Object state)
   at Microsoft.AspNet.SignalR.Infrastructure.Connection.Send(ConnectionMessage message)
   at Microsoft.AspNet.SignalR.Transports.TransportConnectionExtensions.SendCommand(ITransportConnection connection, String connectionId, CommandType commandType)
   at Microsoft.AspNet.SignalR.Transports.ForeverTransport.<>c__DisplayClass1f.<ProcessReceiveRequest>b__1c()
   at Microsoft.AspNet.SignalR.Transports.ForeverTransport.ProcessMessages(ITransportConnection connection, Func`1 initialize)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Owin.Mapping.MapMiddleware.<Invoke>d__0.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Owin.Security.Infrastructure.AuthenticationMiddleware`1.<Invoke>d__0.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Owin.Security.Infrastructure.AuthenticationMiddleware`1.<Invoke>d__0.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Owin.Host.SystemWeb.IntegratedPipeline.IntegratedPipelineContextStage.<RunApp>d__5.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Owin.Security.Infrastructure.AuthenticationMiddleware`1.<Invoke>d__0.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Owin.Host.SystemWeb.IntegratedPipeline.IntegratedPipelineContextStage.<RunApp>d__5.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Owin.Host.SystemWeb.IntegratedPipeline.IntegratedPipelineContext.<DoFinalWork>d__2.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.Owin.Host.SystemWeb.IntegratedPipeline.IntegratedPipelineContext.EndFinalWork(IAsyncResult ar)
   at System.Web.HttpApplication.AsyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
   at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)

AnQueth on 26 Sep 2016

I've also faced this issue over the last few days. I'm running my website on AWS and I'm using an Elasticache node to store session data and as a backplane for SignalR. I found that creating a whole new elasticache node solved the issue. Obviously this isn't ideal though

I have faced a few instances where elasticache nodes have become unusable and are not remedied by a simple reboot of the node.

coultonluke on 30 Sep 2016

Update: I took a look on a server with the problem, some times, and can see some more information.

I am using StackExchange.Redis also out of SignalR for our Cache and more. The same errors was recived from other uses of the same Redis Server/Connections. maybe the problem was started in small network disconnect.

IISRESET (or Recycle) "solve" the problem.

I think the main problem can be the StackExcange.Redis itself. not speficic only SignalR.

MosheL on 2 Oct 2016

moozzyk on 19 Jan 2017

Maybe we are hitting this: https://github.com/StackExchange/StackExchange.Redis/issues/38. While the issue is not closed there were several fixes in this area in StackExchange.Redis which we will get after we updated to a newer version of StackExchange.Redis
/cc @anurse

moozzyk on 27 Jan 2017

I've identified a place where we are not properly resetting the _error variable and will have a fix out soon. I can't be 100% sure it will solve all cases where this happens since I haven't been able to repro the specific cases described in here.

The scenario I've been able to fix is this:

Redis is down, the app starts up and tries to connect
The app fails to connect and caches the error, but some of our internal state believes that the connection is active
Redis is brought up and the app reconnects
The app detects that the connection has been restored but because of the corrupted state in 2 above, we don't reset the connection

I've managed to figure out why we don't detect the connection failure and add some code to fix that. I've also added additional tracing to help identify these issues in the future.