Orleans: Orleans on Kubernetes

Created on 21 Nov 2017  路  25Comments  路  Source: dotnet/orleans

I could not find a good issue that matches, although there might be some overlaps to others issues.

Some weeks ago I started to migrate some parts of my application to orleans and I deployed it to a test stage in kubernetes yesterday. At the beginning it worked well. I created a setup with one node and added several nodes later by increasing the replicas. The new nodes joined the cluster. When I scaled down it became weird:

  1. After several times of scaling up and down I started to get multiple instances of a grain. The grain has a reminder to stay alive. I have no idea when exactly it has happened but it took a long time to recover from that state. Can you explain me what has happened here and how to prevent it?

  2. I scaled down my deployment to 0 nodes and then I scaled up to one node. My membership table (MongoDB) was now full with entries and a lot of old IPs (the new nodes also got new IPs). This new node tried to contact some of them and failed. I am co-hosting asp.net core in the same process and start the Silo first and then Asp.net Core [1]. I guess the silo tries to join the cluster first and it takes time to be sure that the cluster is not alive and it should start anyway. Why are nodes not removed from the membership table when they leave the cluster?

I guess it would be better to deploy Orleans with a stateful set, and use the (static) host names for communication. Do you have experience with Kubernetes and best practices?

EDIT:

I found the exception:

-Failed to get ping responses from all 2 silos that are currently listed as Active in the Membership table. Newly joining silos validate connectivity with all pre-existing silos that are listed as Active in the table and have written I Am Alive in the table in the last 00:10:00 period, before they are allowed to join the cluster. Active silos are: [[SiloAddress=S10.0.5.49:33333:248949710 SiloName=Silo_696f1 Status=Active HostName=squidex-orleans-1 ProxyPort=40000 RoleName=Squidex UpdateZone=0 FaultZone=0 StartTime = 2017-11-21 08:41:50.878 GMT IAmAliveTime = 2017-11-21 08:41:54.616 GMT Suspecters = [] SuspectTimes = []], [SiloAddress=S10.0.4.22:33333:248949708 SiloName=Silo_d4d2b Status=Active HostName=squidex-orleans-0 ProxyPort=40000 RoleName=Squidex UpdateZone=0 FaultZone=0 StartTime = 2017-11-21 08:41:49.108 GMT IAmAliveTime = 2017-11-21 08:41:59.684 GMT Suspecters = [] SuspectTimes = []]]

[1] https://github.com/Squidex/squidex/blob/orleans/src/Squidex/Program.cs

Most helpful comment

well... some documentation and sample codes with the recommended code to run Orleans on Kubernetes in the official repo/docs would be really nice. Maybe this issue can track this...

All 25 comments

I made several experiments, also on my windows machine with very minimal examples. And I have never seen a really graceful shutdown, which means that the entry for the silo is updated in the membership table and marked as deleted. This is problem in any docker environment. Because it takes at least 30 seconds for the container to join an cluster. The only solution would be not to use rolling updates and update the deployment Id with every release. But this sucks.

And I have never seen a really graceful shutdown, which means that the entry for the silo is updated in the membership table and marked as deleted

Yes I was able to reproduce this issue using docker and docker swarm: #3621 The content of this issue is outdated, but the end result is the same: no graceful shutdown using containers...

Are you using Windows or Linux containers? If you are using Linux, FastKill should be working

I am using LInux containers. But I have seen another issue. I migrated the Mongo Membership provider and I tested it again. I had a very simple example:
https://github.com/OrleansContrib/Orleans.Providers.MongoDB/blob/vNext/Test/Host/Program.cs

I started the application and closed by pressing Enter key and this is indeed a graceful shutdown. But even in this scenario the entry in the membership table is not set to 6 (Dead), never 4 (Shutdown) or 5 (Stopped). I am not sure if this is on purpose but it is strange.

I have no idea how to setup a stable production system right now (or in 2.0 RTM) and the impression that the Orleans team has no real experience with running it in Docker is a little bit frightening. I don't want to be the beta customer.

I am using Microsoft.Orleans.Server 1.5.2 and Microsoft.Orleans.OrleansSqlUtils 1.5.2 and it is working for me.

var config = new ClusterConfiguration( );
...
config.Globals.LivenessType = GlobalConfiguration.LivenessProviderType.SqlServer;
config.Globals.ReminderServiceType = GlobalConfiguration.ReminderServiceProviderType.SqlServer;
...
m_SiloHost.StartOrleansSilo( );

// In shutdown code
m_SiloHost?.StopOrleansSilo( );
m_SiloHost?.Dispose( );

Calling SELECT * FROM ORLEANS.dbo.OrleansMembershipTable on my production environment is showing lots of Status 6 silos.

edit: I suppose it is just finding the descrepencies between 1.5.2, 2.0 beta-1, or is it mongodb vs sql?

Could be MongoDb of course. Lets say you scale down to 0 nodes. What statuses would you get?

Could be MongoDb of course. Lets say you scale down to 0 nodes. What statuses would you get?

With ungraceful shutdown, the last silos that get stopped (killed) will stay in the table as Active because they won't mark themselves as Dead and there will be no more silos in the cluster left to detect their death.

Could be MongoDb of course. Lets say you scale down to 0 nodes. What statuses would you get?

Don't use Kubernetes in particular. But I don't see how the environment would affect the membership table. A database is a database. But I can scale from 0-15-0 without harm to the cluster. Haven't seen duplicates or any discrepancies in Active/Zombie silos.

edit: Assuming clean shutdown, all statuses are 6

I only have a couple reminders, but they never seem to have issues.

I also think the problem is not the membership table. It is the shutdown.

When I redeploy my application I get an old entry in my membership table, nobody clears it up. Then the new container tries to connect with this node and waits 30 seconds. Therefore the deployment is a little bit annoying and slow.

The best solution would be to have a graceful shutdown. I guess it should be also fine to use host names as silo addresses. Then I could create a stateful set where the host names are static and the new container would immediately detect that the old entry is and old version of itself. But it does not seem to work. I tried to use

clusterConfiguration.Defaults.HostNameOrIPAddress = Dns.GetHostName();

When I redeploy my application I get an old entry in my membership table, nobody clears it up.

It eventually gets cleared up (in a few minutes).

The best solution would be to have a graceful shutdown.

Definitely. @benjaminpetit is looking at the Docker graceful shutdown issue - #3621.

You are right. It gets cleared up after ~30 seconds. But the deployment takes longer that it haves too, 30 seconds per node, which is a little bit annoying with rolling updates.

@SebastianStehle Did you try it with StatefulSets and maybe also headless services (https://kubernetes.io/docs/concepts/services-networking/service/#headless-services)? That should provide you with unique and routable FQDNs for your nodes.

@christianhuening I tried Stateful sets but they also get a random IP => same problem. Would be solved if Orleans would be fine with HostNames, but it always uses IP address for membership tables (I think there is an issue for that).

About Headless Server: I would need one Service per Pod. This abuses the concept of headless services, I think.

Is there any chance here to have first class K8s support via like an official helm chart and/or operator or something like that for Orleans?

Any update? I would really like an Orleans supported K8s deployment that includes how to safely scale up/down etc...

I am going to try that soon, as I need that setup for my PhD. But from what @SebastianStehle wrote, it sounds like some action is required in the code to make this really work with Kubernets. It sounds a bit like the problems a colleague had with running a larger Erlang deployment in K8s.

I have been working on a sample app on Azure AKS. Scaling up was not a problem but scaling down did not work well. I believe I found a bug in FastKillOnCancelKeyPress=true that does not signal the Silo to shutdown at all.

https://github.com/dotnet/orleans/issues/3933.

@jms69, that's great do you have a repo you can share? Thanks.

This repros in a console app as well. Set FastKillOnCancelKeyPress=true, start the silo, then wait for the silo to stop.

            while (silo.Stopped.IsCompleted == false)
            {                    
                await Task.Delay(1000);
            }

Hit CTRL-C in the Console window and you will see messages waiting for the silo to terminate but it will never terminate.

This fixed was just merged into master. https://github.com/dotnet/orleans/commit/4d76acb4a526a6c3c6ce6fc5d70eba0464fbbcd4

Keep an eye on https://github.com/OrleansContrib/Orleans.Clustering.Kubernetes

I'm finishing samples and will publish first preview packages to NuGet soon.

Any news?

Orleans works really well on kubernetes, have it in production since a year or so.

Should we close this now?

what is the conclusion of this issue? how can I graceful-stop the silo when running on Kubernetes?

Keep an eye on https://github.com/OrleansContrib/Orleans.Clustering.Kubernetes

I'm finishing samples and will publish first preview packages to NuGet soon.

@GersonDias I've been using it even before I've released that package and I know other companies that are using it so far without issues. Even without that package you can use Orleans with another membership provider and it should work as long as the membership table is recording the correct kubenet IP address of the pods hosting the silos and the clients can also reach them.

I'm looking on a future to create a Kubernetes controller for Orleans and replace the membership provider to make things easier deployment-wise but it definitively isn't a requirement to make it work.

I guess this issue can be closed now as there is not much todo in Orleans itself...

well... some documentation and sample codes with the recommended code to run Orleans on Kubernetes in the official repo/docs would be really nice. Maybe this issue can track this...

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bwanner picture bwanner  路  5Comments

turowicz picture turowicz  路  3Comments

scharada picture scharada  路  3Comments

luciobemquerer picture luciobemquerer  路  4Comments

jdom picture jdom  路  3Comments