Orleans: -Failed to get ping responses from all 1 silos that are currently listed as Active in the Membership table

Created on 22 Mar 2019 · 5Comments · Source: dotnet/orleans

Orleans.Runtime.MembershipService.MembershipOracleData[100661]
-Failed to get ping responses from all 1 silos that are currently listed as Active in the Membership table. Newly joining silos validate connectivity with all pre-existing silos that are listed as Active in the table and have written I Am Alive in the table in the last 00:10:00 period, before they are allowed to join the cluster. Active silos are: [[SiloAddress=S10.0.75.1:11111:290933742 SiloName=Silo_048a5 Status=Active HostName=LiBo ProxyPort=30000 RoleName= UpdateZone=0 FaultZone=0 StartTime = 2019-03-22 06:55:43.000 GMT IAmAliveTime = 2019-03-22 07:00:51.000 GMT ]]
warn: Orleans.Runtime.Scheduler stg/15/0000000f.WorkItemGroup[101215]
Task [Id=1653, Status=Faulted] in WorkGroup [SystemTarget: S172.19.227.65:11111:290933946stg/15/0000000f@S0000000f] took elapsed time 0:00:00.2714147 for execution, which is longer than 00:00:00.2000000. Running on thread System.Threading.Thread

Source

qgxcry

Most helpful comment

So what is the question here?

A wild guess (I may be wrong - not enough data) - you might be restarting a previously abruptly shut down cluster with the same cluster ID, and some silos are still listed in the table as Active while they are actually dead.

sergeybykov on 26 Mar 2019

👍2

All 5 comments

So what is the question here?

sergeybykov on 26 Mar 2019

👍2

Closing due to inactivity. Feel free to reopen if needed.

sergeybykov on 19 Apr 2019

cluster ID

That's right, we're working on a team development, everyone using the same cluster ID to debug the native code. Later modified for everyone to debug with a separate cluster ID, everything is normal, the problem is solved

qgxcry on 19 Apr 2019

👍1

you might be restarting a previously abruptly shut down cluster with the same cluster ID, and some silos are still listed in the table as Active while they are actually dead.

We have same issue exactly with this case.

What can we do with startup of the very first silo in cluster? It seems to always fail with timeout at start.