With Redis scale-out, once Send() method runs into an error, if the instance doesn't receive any message, then in the instance the stream will never be opened again until Redis is restarted. No connections in the instance can send a message. Instead they return the same error. Restarting Redis fixes the issue.
1). Use AspNet SignalR sample, update it to use Redis scale-out, just let one web app use the Redis, start the Redis
2). Since the issue not easy to repro, I add below code in the RedisMessageBus.Send() method to throw when message contains “throw”:
protected override Task Send(int streamIndex, IList<Message> messages)
{
var message = messages[0].GetString();
if (message.Contains("throw"))
{
throw new Exception("Verify Throwing an exception from Send()");
}
. . . . . .
}
3). Browse to Hubs/HubConnectionAPI/Default.aspx page
4). Input "throw" in "To Everybody" textbox, Click button "Broadcast". This will cause Send method throw.
5). Input "aaaa" in "To Everybody" textbox, Click button "Broadcast". Observe " Error: Verify Throwing an exception from Send()"
6). Wait for 1 minute or more, click "Broadcast" button again, Repeat this step several times.
Like SB scale-out /SqlServer scale-out, "Broadcast" can send the messages successfully.
"Error: Verify Throwing an exception from Send()" was always thrown for sending the messages which should not cause Send method throw in step (6).
@Xiaohongt try reproing this by throwing an error in the Send method in RedisMessageBus and see if the subsequent sends reach that point. I did that and it seems to work.
1). This repro with one instance, this also repro in old Redis scale-out.
2). This doesn’t repro in SB scale-out /SqlServer scale-out with one instance.
e.g. In SB scale-out, when sending message failed, no message can be received in SB scale-out, SB scale-out Receiver will wait 60 seconds, then it will Open the stream.
@Xiaohongt can you also try it by returning a faulted task in Send, rather than throwing directly? I'd be interested if the behavior is the same.
List one actual exception from StackExchange which I saw one time before when sending messages always returned this actual error (this error is not easy to repro):
SignalR.ScaleoutMessageBus Error: 0 : Stream(0) - Send failed: System.AggregateException: One or more errors occurred. ---> System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'XIAOTA01-01'.
at StackExchange.Redis.ConnectionMultiplexer.ExecuteAsyncImpl[T](Message message, ResultProcessor`1 processor, Object state, ServerEndPoint server) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\ConnectionMultiplexer.cs:line 1629
at StackExchange.Redis.RedisBase.ExecuteAsync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line 72
at StackExchange.Redis.RedisDatabase.ScriptEvaluateAsync(String script, RedisKey[] keys, RedisValue[] values, CommandFlags flags) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisDatabase.cs:line 781
at Microsoft.AspNet.SignalR.Redis.RedisMessageBus.Send(Int32 streamIndex, IList`1 messages) in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Redis\RedisMessageBus.cs:line 62
at Microsoft.AspNet.SignalR.Messaging.ScaleoutStreamManager.Send(Object state) in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Core\Messaging\ScaleoutStreamManager.cs:line 86
at Microsoft.AspNet.SignalR.Messaging.ScaleoutStreamManager.<Send>b__0(Object state) in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Core\Messaging\ScaleoutStreamManager.cs:line 71
at Microsoft.AspNet.SignalR.Messaging.ScaleoutStream.SendContext.InvokeSend() in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Core\Messaging\ScaleoutStream.cs:line 320
--- End of inner exception stack trace ---
---> (Inner Exception #0) System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'XIAOTA01-01'.
at StackExchange.Redis.ConnectionMultiplexer.ExecuteAsyncImpl[T](Message message, ResultProcessor`1 processor, Object state, ServerEndPoint server) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\ConnectionMultiplexer.cs:line 1629
at StackExchange.Redis.RedisBase.ExecuteAsync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line 72
at StackExchange.Redis.RedisDatabase.ScriptEvaluateAsync(String script, RedisKey[] keys, RedisValue[] values, CommandFlags flags) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisDatabase.cs:line 781
at Microsoft.AspNet.SignalR.Redis.RedisMessageBus.Send(Int32 streamIndex, IList`1 messages) in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Redis\RedisMessageBus.cs:line 62
at Microsoft.AspNet.SignalR.Messaging.ScaleoutStreamManager.Send(Object state) in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Core\Messaging\ScaleoutStreamManager.cs:line 86
at Microsoft.AspNet.SignalR.Messaging.ScaleoutStreamManager.<Send>b__0(Object state) in d:\dd\SignalR_Dev_2.1\SignalR\src\Microsoft.AspNet.SignalR.Core\Messaging\ScaleoutStreamManager.cs:line 71
at Microsoft.AspNet.SignalR.Messaging.ScaleoutStream.SendContext.InvokeSend() in d:\dd\SignalR_Dev_
*update
@Xiaohongt so you should not catch the exception if you want to repro the faulted task scenario. How did you repro this issue in SB and SQL scaleout? were you throwing an exception directly or were you returning a faulted task.
Right. Here is updated Send() method in RedisMessageBus as below to return a faulted task when message contains “faulted”.
The behavior is same, the subsequent sends return the same error:
1). Input "faulted" in "To Everybody" textbox, Click button "Broadcast", client will receive the error "Verify faulted Task from Send()"
2). Input "aaaa" in "To Everybody" textbox, Click button "Broadcast", client will receive the same error
protected override async Task Send(int streamIndex, IList<Message> messages)
{
var message = messages[0].GetString();
if (message.Contains("throw"))
{
throw new Exception("Verify Throwing an exception from Send()");
}
if (message.Contains("faulted"))
{
return Task.Run(() =>
{
throw new Exception("Verify faulted Task from Send()");
});
}
else
{
. . . . .
}
}
In SB and SQL scale-out, I directly throw an exception in Send(), for SB scale-out, after 60 seconds the subsequent sends can succeed, for SQL scale-out after shorter than 60 seconds the subsequent sends can succeed.
I have tried repro-ing this issue for Redis and SB. As, @Xiaohongt mentioned earlier, in Redis once we set the error after an exception - we don't set set-it back to null. As a result all the subsequent sends for that particular stream fail. On the other hand for SB, after an exception, OnMessage is triggered and that re-opens the faulted stream allowing the subsequent sends to succeed. (I am still unsure about why OnMessage is being triggered in SB after an exception.)
After discussion, we plan to fix this in v3 as well as removing QueuingBehavior.Always in v3.
We're hitting this issue at least once a week in our production scenario deployed to 4 Azure webroles when a scaling operation creates new webroles. In our case restarting the webrole (not the Redis server) addresses the issue.
Is there a workaround to prevent this issue from happening with the current SignalR release (2.2.0) ? or could this issue be addressed on the next release?
I can see the error also, with SignalR.
My other connections to signalR works.
(edited, the first investigation was wrong)
I also am hoping to see a 2.2.0 workaround or some way to mitigate this.
I tnink that the workaround can be try/catch in any the PersistentConnection
and try to not fail.
for example - send data with MVC /ASHX and not within signalr.
had this issue this morning on 2.2, no hope of a fix ? this is pretty detrimental issue.
the problem is coming from never clearing the _error variable if the StreamState doesn't change. I'm assuming there is a race condition going on with the settings and clearing of this _error variable and state changing. I ran verbose mode of stack exchange to make sure it was reconnecting and it is. once I went into the Open method of scaleoutstream.cs and moved the _error = null to before the ChangeState if statement and it appears to work correctly now when stackexchange reconnects.
Is there a reason this approach wouldn't work or be accepted as a fix for this rather big issue in signalr that is really effecting production environments?
The seems to be a race condition. The theory here is that an error triggered a reconnect to Redis. Reconnect happens on one thread and it clears the _error variable. However a failing Send on a different thread that used the old connection (i.e. from before a successful reconnect) is not aware of the reconnect and is resetting the _error back to the error which triggered the reconnect. At this point we have an open and working connection and the _error set. Since the connection is open and working nothing is trying going to clear the _error (it is only cleared on opening a connection) and the _error is being rethrown each time Send is called.
I no longer think it's possible for RedisMessageBus.ConnectWithRetry() to dispose the IRedisConnection (and therefore the ConnectionMultiplexer which is the source of the ODE) without the RedisMessageBus moving to the disposed state itself.
My new theory is that RedisMessageBus.ConnectToRedisAsync() succeeds (therefor preventing a retry), but the OnConnectionRestored callback is called before ConnectToRedisAsync() returns thereby moving the RedisMessageBus into the Connected state prematurely. This unexpected state would cause ConnectWithRetry() to call Shutdown() without any call to Dispose() and exit the retry loop. This in turn would dispose the ConnectionMultiplexer explaining the ODE.
It's tough to verify this without a repro. If we could get a memory dump of an app in the failing state, we could at least verify the RedisMessageBus is in the Disposed state but not the DefaultDependencyResolver indicating that RedisMessageBus.Dispose() was never called.
There are 2 scenarios, one with object disposed which is very rare, i've only seen it once and the more common one is the _error not being cleared but the connection multiplexer saying all is connected and working.. I've done a lot of windbg analysis on this issue. The best way to recreate it is to put an huge load on the server (which can start effecting tcp connections and such) and then attach a windbg when the issue is happening.
@Quethrosar - any chance you could share a dump?
I guess it would be possible for the _error field not to get reset anytime there is an error originating from ScriptEvaluateAsync without a subsequent invocation of the OnConnectionRestored callback. This could be due to a race, or even just an misunderstanding of the behavior of the ConnectionMultiplexer API.
@Quethrosar Since the ODE is rare, what is the more common exception(s) not cleared from _error? A dump would also be super helpful of course.
SciptEvaluateAsync called from OnConnectionRestored sometimes failed before 2.2.1 where there was no value in the cache (https://github.com/SignalR/SignalR/issues/3436)
i'll try and get a memory dump for you but here's the usual call stack from our logging.
in our case it seems to be SocketFailure on eval. I have verified that stackexchange reconnects. i also verified killing the client connection from redis itself also fixes the issue although this is not acceptable.
StackExchange.Redis.RedisConnectionException: SocketFailure on EVAL
at Microsoft.AspNet.SignalR.Messaging.ScaleoutStream.Send(Func`2 send, Object state)
at Microsoft.AspNet.SignalR.Infrastructure.Connection.Send(ConnectionMessage message)
at Microsoft.AspNet.SignalR.Transports.TransportConnectionExtensions.SendCommand(ITransportConnection connection, String connectionId, CommandType commandType)
at Microsoft.AspNet.SignalR.Transports.ForeverTransport.<>c__DisplayClass1f.<ProcessReceiveRequest>b__1c()
at Microsoft.AspNet.SignalR.Transports.ForeverTransport.ProcessMessages(ITransportConnection connection, Func`1 initialize)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Owin.Mapping.MapMiddleware.<Invoke>d__0.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Owin.Security.Infrastructure.AuthenticationMiddleware`1.<Invoke>d__0.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Owin.Security.Infrastructure.AuthenticationMiddleware`1.<Invoke>d__0.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Owin.Host.SystemWeb.IntegratedPipeline.IntegratedPipelineContextStage.<RunApp>d__5.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Owin.Security.Infrastructure.AuthenticationMiddleware`1.<Invoke>d__0.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Owin.Host.SystemWeb.IntegratedPipeline.IntegratedPipelineContextStage.<RunApp>d__5.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Owin.Host.SystemWeb.IntegratedPipeline.IntegratedPipelineContext.<DoFinalWork>d__2.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.Owin.Host.SystemWeb.IntegratedPipeline.IntegratedPipelineContext.EndFinalWork(IAsyncResult ar)
at System.Web.HttpApplication.AsyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)
I've also faced this issue over the last few days. I'm running my website on AWS and I'm using an Elasticache node to store session data and as a backplane for SignalR. I found that creating a whole new elasticache node solved the issue. Obviously this isn't ideal though
I have faced a few instances where elasticache nodes have become unusable and are not remedied by a simple reboot of the node.
Update: I took a look on a server with the problem, some times, and can see some more information.
I am using StackExchange.Redis
also out of SignalR for our Cache and more. The same errors was recived from other uses of the same Redis Server/Connections. maybe the problem was started in small network disconnect.
IISRESET (or Recycle) "solve" the problem.
I think the main problem can be the StackExcange.Redis
itself. not speficic only SignalR.
Seems related: https://github.com/SignalR/SignalR/issues/3653
Maybe we are hitting this: https://github.com/StackExchange/StackExchange.Redis/issues/38. While the issue is not closed there were several fixes in this area in StackExchange.Redis which we will get after we updated to a newer version of StackExchange.Redis
/cc @anurse
I've identified a place where we are not properly resetting the _error
variable and will have a fix out soon. I can't be 100% sure it will solve all cases where this happens since I haven't been able to repro the specific cases described in here.
The scenario I've been able to fix is this:
I've managed to figure out why we don't detect the connection failure and add some code to fix that. I've also added additional tracing to help identify these issues in the future.
Most helpful comment
We're hitting this issue at least once a week in our production scenario deployed to 4 Azure webroles when a scaling operation creates new webroles. In our case restarting the webrole (not the Redis server) addresses the issue.
Is there a workaround to prevent this issue from happening with the current SignalR release (2.2.0) ? or could this issue be addressed on the next release?