I need to ensure that each task/replica of my docker service run on separate nodes for HA (so if one server fails or is shutdown it does not take my service offline). Currently, I am running 3 separate services with constraints allocating each service to a different node, but I'd really like to construct my image such that it can run as a single service with 3 replicas.
I can think of several ways to support such a feature, but the most general, straight forward way might be to simply add an option called --max-replicas-per-node=<n> on the docker service create command where <n> is a positive integer. So --max-replicas-per-node=1 would ensure that the swarm scheduler places at most 1 replica on a node.
https://docs.docker.com/engine/reference/commandline/service_create/
As a workaround, I think you can use "global" service instead of --max-replicas-per-node=1
Using --mode=global would be awkward, or does this mode also support --replicas? I have been thinking that --replicas applies only to --mode=replicated.
Also, can I use --mode=global --constraint= to control how many nodes I want running only 1 replica?
I've assumed that --mode=global is intended to run 1 task on all nodes in the cluster and that you couldn't control the number of replicas or constrain the scheduler.
_USER POLL_
_The best way to get notified of updates is to use the _Subscribe_ button on this page._
Please don't use "+1" or "I have this too" comments on issues. We automatically
collect those comments to keep the thread short.
The people listed below have upvoted this issue by leaving a +1 comment:
@yank1
@aluzzardi WDYT ?
Suppose we have --max-replicas-per-node=1 for a single service ourservice in a 2 node swarm
If we then were to do docker scale ourservice=3, what would the behavior be?
Its more like we want to say distribute this service evenly across nodes as much as possible rather than specify a maximum value.
What happens if you apply a constraint to a service and there are no nodes available to deploy all replicas of the service?
I would think the replicas that currently have no available node would just fail to be scheduled on the cluster until a slot becomes available.
In this way, --max-replicas-per-node is just another scheduling constraint. Same as setting memory requirements for each replica. If no nodes are available to run the service at the scale requested, the unscheduled replicas should just be pend until a slot becomes available. Or, you can just issue a WARNing about insufficient cluster resources to scale fully at this time.
BTW, I don't want to just specify "distribute this service evenly across nodes" since I will be specifying other constraints in addition to --max-replicas-per-node in a production cluster (e.g., run only on the nodes with fast NVMe SSD storage).
/cc @aaronlehmann
So we actually have an "HA scheduler" coming up that will attempt to spread your replicas across different machines (see https://github.com/docker/swarmkit/issues/308).
That will be the default scheduling strategy in 1.13.
Long story short, if you scale a service to 3 replicas, all of them will end up on different machines if possible.
If you don't have enough machines, then new replicas will go on the node with the least replicas already running.
Would that solve your problem?
You should be able to test on a master binary soon enough
Would that solve your problem?
Not sure. I really need to guarantee that at most 1 replica is running on one node. Not so much for cluster startup, but what happens if one of the nodes fails?
I am wanting to deploy a MySQL cluster of 3 replicas where the first replica configures itself as the RW master with the second and third replica being slaves of the master. I want to run a ProxySQL instance sitting in front of this MySQL service so the master gets all updates and the slaves can handle some of the read-only queries. I will require a monitor watching the service to promote one of the slaves to become the new master if the master node fails (for any reason).
I don't think your "HA scheduler" will be that useful for my MySQL cluster. First, I will have to label 3 nodes with a label so that this scheduler will only deploy my service to these 3 nodes. If one of the nodes fails (maybe just for a few minutes or an hour), I think your "HA scheduler" would attempt to create another replica of my service on one of the 2 remaining nodes. Each node will be provisioned for only 1 copy of the database, so running two replicas on one node wouldn't work for me.
I want the "HA scheduler" to wait patiently if there are less nodes available to run all requested replicas and not start 2 replicas on one node. If a node fails, I want to manually provision another node (with a current snapshot of the cluster database) and only label it when I am ready for the "HA scheduler" to spin up another replacement replica on the new node. How the new replica "joins" my cluster is probably something I will have to do with an external script.
Maybe Docker Services just aren't flexible enough for me to deploy a HA service with persistent local server storage and get this to all work with replication, etc. HA availability is not just for service creation (picking where the initial replicas run), but also for service update (scaling up/down the number of replicas) and node/replica failure.
So, I think my original proposal is still needed. That is, allow the service creator to specify the maximum number of replicas and to provide a constraint as to what nodes are allowed to run the service on and if there are insufficient nodes currently labeled for deploying all the requested replicas, then the scheduler will have to wait until more nodes are properly labeled before reaching the desired state.
Maybe rather than adding --max-replicas-per-node, you come up with a --constraint syntax that would allow "max replicas per node" to be one of the scheduling constraints just like node labels may be used as a constraint?
+1 here. I want to have swarm run three replicas of a service per node, but I don't want more than three per node due to resource utilization/performance.
Linking https://github.com/docker/docker/issues/24649, which is related / similar
Also related: https://github.com/docker/docker/issues/24115
@shane-axiom This would be achieved by setting the proper resource constraints, would it not?
Limiting a service to only have N number per node because of _resources_ seems backwards. It would be better to say "This service requires X RAM, Y CPU, Z DISK" and let the scheduler decide what is available resource wise.
...
Not to say there is no benefit at all of allowing at most N per node, but resources doesn't seem right.
@ktwalrus Scheduling of services like databases (or I suppose self-clustering services) I think will need some special case. I don't think it would work with a "No more than N on a node" nor on the HA scheduler that @aluzzardi mentioned without specific tweaks for these kinds of services.
Scheduling of services like databases (or I suppose self-clustering services) I think will need some special case.
Yes. The current services scheduler needs more enhancements to support a single replicated MySQL database service. But, the first enhancement needed in my mind is that all replicas run on a separate node (for HA). I'm not sure how service definition should evolve in order to support this use case, but one possible direction is to allow the specification of a user-defined proxy that is used to "load balance" the database service. Currently, a deployed proxy has no way that I know of to connect to individual replicas of the service (only the service name is exposed for the proxy to use).
In the meantime, database services will probably need to be deployed as multiple single replica services constrained to a particular node and having a separate service definition for each instance in the replicated cluster. Maybe even not defining services for persistent replicated local data, but falling back to unmanaged containers (using docker run) is the way to go. Unfortunately, unmanaged containers cannot attach to swarm overlay networks that the database proxy service would use to connect to the database containers in Docker 1.12. As I understand it, this is fixed in 1.13 so running database service in unmanaged containers will probably be the way to go until docker service is enhanced for this use case.
Related to high availability scheduling: https://github.com/docker/swarmkit/pull/1446
This changes the scheduler to run replicas on separate nodes whenever possible.
@shane-axiom This would be achieved by setting the proper resource constraints, would it not?
Limiting a service to only have N number per node because of resources seems backwards. It would be better to say "This service requires X RAM, Y CPU, Z DISK" and let the scheduler decide what is available resource wise.
...
Not to say there is no benefit at all of allowing at most N per node, but resources doesn't seem right
@cpuguy83 Thanks, just getting started with swarm so that's helpful. As you said I think there's still a place for saying no-more-than-x of a service per node, but my use case can be solved otherwise.
On the service create docs I see --reserve-cpu and --reserve-memory but no --reserve-disk. Any suggestions on how to handle disk space reservation? I don't see an obvious answer using constraints/labels.
I also see an engine constainerslots label/filter in the standalone swarm docs. Anyone know if this works/is recommended in the integrated swarm?
@shane-axiom can you open a new issue for that (disk reservation)?
Added new issue for --reserve-disk here: https://github.com/docker/docker/issues/27555
Trying to find a way to ensure my db containers do not run on the same node.
My cluster have 3 masters and 6 workers:
Pikwi > ENV = M01-TOR1
Pikwi > ENV = M02-SFO2
Pikwi > ENV = M03-AMS3
Pikwi > ENV = W11-TOR1
Pikwi > ENV = W12-SFO2
Pikwi > ENV = W13-AMS3
Pikwi > ENV = W14-TOR1
Pikwi > ENV = W15-SFO2
Pikwi > ENV = W16-AMS3
I want to have 3 db running on my three workers.
Worker W11, W12, W13 are labelled as 'labels.db_accepted==yes'
The logic is that W14, W15, W16 won't accept the DB container.
When I run:
docker service create \
--name mysql \
--network db-net \
--mode global \
--constraint node.role!=manager \
--constraint node.labels.db_accepted==yes \
--restart-condition any \
-p 3306:3306 \
-e MYSQL_ROOT_PASSWORD=somepass1 \
-e DISCOVERY_SERVICE=10.0.3.2:2379 \
-e XTRABACKUP_PASSWORD=somepass2 \
-e CLUSTER_NAME=galera \
perconalab/percona-xtradb-cluster:5.6
It's 芦kind禄 of working:
docker service ps mysql
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
0prfrm2v5y1xnf5ddpqclbfyk mysql perconalab/percona-xtradb-cluster:5.6 W13-AMS3 Running Running 6 minutes ago
7wj3kkjrhu6pwtjbkamkow0jm \_ mysql perconalab/percona-xtradb-cluster:5.6 W12-SFO2 Running Running 6 minutes ago
1h5lawh8lsgheeo17ou2p7n7j \_ mysql perconalab/percona-xtradb-cluster:5.6 W11-TOR1 Running Running 6 minutes ago
d0jum8lt9kyo7rj1lulp6p66z \_ mysql perconalab/percona-xtradb-cluster:5.6 M03-AMS3 Running Allocated 6 minutes ago
6ml7xd5lnl69llwv3byia938t \_ mysql perconalab/percona-xtradb-cluster:5.6 M02-SFO2 Running Allocated 6 minutes ago
dsj2z13ykilwwpzwc2wj09qh3 \_ mysql perconalab/percona-xtradb-cluster:5.6 M01-TOR1 Running Allocated 6 minutes ago
The weird thing here is the fact that Swarm tries to allocate the db to the M01, M02 M03 nodes but the --constraint node.role!=manager AND --constraint node.labels.db_accepted==yes is preventing Swarm to do it.
My question - Why is Swarm tries to Allocated on my master nodes with the constraints I set ?
IMHO, it would rock to be able to define
--MODE global WHERE node.labels.db_accepted==yes\
or simply
--MODE node.labels.db_accepted==yes\
or
--MODE limited
Where mode would listen to constraint and affinity
@ktwalrus maybe I connect with you about https://github.com/docker/docker/issues/26259#issuecomment-246850441 . I have the exact challenge and I want to have a production ready MYSQL master/master HA setup. I'm almost there :)
twitter > @_pascalandy
or if you prefer > pascal _hat_ pascalandy _doot_ com
Cheers!
Hey @pascalandy - a lot of improvements on the scheduler were made in 1.13, could you give 1.13.0-rc1 a try?
For instance, global services in 1.13 do not create tasks anymore if the constraints are not met, which solves the feature request you were proposing.
Also, 1.13 has a new scheduling algorithm and will avoid placing more than one container per service on the same machine if possible.
So if you have 3 machines and scale MySQL to 2, each container will run on a different machine. Scale to 3 and each machine will run one container. Scale to 4 and, at that point, 2 machines will be running one and 1 machine will be running two (because there's no way to avoid that). Think of it as a "scheduling preference".
Would love to hear your feedback on 1.13!
/cc @aaronlehmann
Hi @aluzzardi. I'm running 1.13 since about 10 days now.
1) I'll be glad to try 1.13.0-rc1 a try? Please send me a link that explains how to upgrade to these RC1 / experimental releases. I have no idea :-p
So far I build my server from a clean Ubuntu image and install everything for scratch via bash script(s).
2)
Scale to 3 and each machine will run one container.
The case - This is great in many cases but there is a major concern here. Let鈥檚 say I have 3 machines running docker create scale mysql=3. Swarm deploy one instance per machine and life is good.
The challenge - When 1 node goes down, the scheduler will deploy the third mysql instance over one of the two left nodes. This is bad for a database container! This could brake the setup and bring some incoherences in the mysql cluster.
That鈥檚 why few folks believe we need a new flag. I see 2 ways of doing it:
--max-instance-per-node 1--replicas 3 \
--max-instance-per-node 1 \
--constraint "node.role==worker" \
--constraint "node.labels.db_accepted==yes" \
-e affinity:container!=~mysql*or use the existing affinity with a negative condition:
--replicas 3 \
-e affinity:container!=~mysql* \
--constraint "node.role==worker" \
--constraint "node.labels.db_accepted==yes" \
Cheers!
Pascal
Rather than a "--max-replicas-per-node" constraint, it would actually be more practical to just have a "--replicas-per-node" constraint and just like the "--replicas" constraint, should be systematically treated as a "minimum" value... which is especially important upon the consideration of:
If "--replicas-per-node" is used in tandem with the "--replicas" constraint then a "scalable floor value" has been defined... and even if you never use it, flexibility is still good. Any new node(s) that are introduced to the system would be forced to respect the value as the HA and LB processes occur.
Consider what the Core Practical Requirements towards the intended purpose of the proposed constraint could be?
That's not too long of a list for what it achieves... but then again, I'm not the person who would actually have to go modify the Docker Swarm code-base to make it a reality either.
I'm here from https://github.com/docker/docker/issues/28787 to describe the use case I have for soft affinity (anti-affinity is not interesting for me).
In our situation we have groups of services which work together (think independent test environments). Each group consists of about 6-10 services which communicate between each other, but not between groups; they each have their own overlay network. I'd like as optimisation that services that talk a lot with each other get scheduled on the same node, but obviously things won't break if they aren't. There are some shared services but I'm not concerned about those.
What I was looking at in the classic swarm was label affinity, so I could label each group and make each container have affinity with it's own group.
Reading some of the earlier comments I understand the issue with respect to scheduling and asymmetric constraints, but it seems to me that this kind of affinity is a property of the swarm, not of the services, adding affinity to each service is the wrong approach. The structure I want is already declared in the overlay networks. If I could just say "prefer to schedule services on the same networks to the same nodes" then I'd be happy.
Networks with many members could attract less than networks with few members. I say this because of the current limitation that you can't add/remove networks from existing services, so every service that needs to talk to a shared service must also be on a single big overlay network because you can't on the fly add a new overlay network to a shared service. This big network should not affect affinity. So perhaps it is a property of the network?
Hope this helps.
As a workaround, in Docker 1.13, you can define a dummy host port (13999 in this example), this will effectively limit the service to one replica per node:
docker service create \
--name x \
--replicas 2 \
-p mode=host,published=13999,target=13999 \
alpine:3.4 sleep 3000
This rocks @sundryp !! Little hacky but perfectly viable. You're smart 馃憤 . Now I want to see native Round Robin between those containers :)
I'm still curious to know what will be the official way to manage this. I don't wanna mess with ports when there is hundreds of containers at stakes :-p
So, I've also had the same use case similar to the one mentioned by @ktwalrus and @pascalandy.
I've tested the usage of "global" deploy with per-node tags, as suggested by @aluzzardi and it seems to work nice in docker 1.13.
In docker-compose:
deploy:
mode: global
placement:
constraints:
- node.labels.db == true
Then
docker node update --label-add db=true NODEID
So that should solve the use-case, unless I'm missing something.
I totally understand the reluctance (if any) to add such a --max-replicas-per-node flag. It'd add significant complexity for both the implementation scheduling as well as the users configuring services. You don't want to add flag after flag until you find yourself lost in a deep SAT problem :)
It would be more than just useful to have --max-replicas-per-node or --replicas-per-node flags for services.
In my case I'm running HDFS cluster with swarm together with more microservices. I have a several nodes in swarm, and I want to run datanodes only on 2 of them, one on each. Initially it works, but sometimes, if datanode fails for some reason, swarm runs a new container instead, and sometimes it ends up running 2 datanodes on the same swarm node, which is unacceptable for the application.
The workaround is to update the server with replicas=0 and then back to 2... As you see it's quite critical to have such an option.
When this feature is planned to be added?
@liyaka Have you seen the suggested workaround? For me this is an acceptable solution until there is a real one.
Using global with constraint worked fine for me
I was looking forward to migrate from Docker Swarm to Docker Swarm Mode, but without container anti-affinity it generates lot of overhead for scaling and spreading containers across the cluster.
I need to spread containers in case host is unavailable (crashed) that there are still some containers on the other hosts to serve clients.
I can't use workaround with node labels constraint, it's too much work to maintain the labels if you have already ~100 different stacks. Allocating extra dummy static port won't work if you have container instances of the service stack more than docker hosts in the cluster.
If I use global mode, then I can't have more instances than hosts and usually I don't want to depend on the size of my Swarm cluster.
I hope there will be a feature to tell to spread number of container instances of the stack to multiple hosts.
@rreinurm With swarm mode, containers have an implicit anti-affinity with themselves. It will put every instance of your service in different machines, if that's possible.
Would that work for you?
I don't think it cover this scenario:
The case - This is great in many cases but there is a major concern here. Let鈥檚 say I have 3 machines running docker create scale mysql=3. Swarm deploy one instance per machine and life is good.
The challenge - When 1 node goes down, the scheduler will deploy the third mysql instance over one of the two left nodes. This is bad for a database container! This could brake the setup and bring some incoherences in the mysql cluster.
Here are my (personal) thoughts about that - I might be wrong but that's how I see it:
if you have more than 3 machines, then it won't go into the same machine as another mysql.
If you only have 3 machines, if you're using proper resource limits / reservations, then the 2 mysqls running on the same machine won't step into each other.
I don't think two mysqls running on the same machine are a problem if properly contained.
In a cloud environment, where your actual machine is a VM, chances are you're not alone on that physical CPU/memory and you're probably running alongside another mysql anyway.
Resource limitation is key here
WDYT?
@pascalandy What's the problem of using "global deploy" + "node constraints" as mentioned above?
@aluzzardi
I don't think two mysqls running on the same machine are a problem if properly contained.
I think, it's not so good. Here is another example, I launch HDFS cluster with 10 DataNodes. In my opinion, it's convenient to store stateful datanode data on host (bind-mounted volume). But, of course, it's unacceptable to run 2 or more DataNode containers on same host.
The problem with using "global deploy" + "node constraints" is that it doesn't enable us to have a failover system without specifying the nodes it can run on up front.
I want to have at least one instance of my service running at all times, even if I take down all but on of my nodes.
i.e. I want to run three nodes and two instances of each service.
In normal use I don't care how the services are spread over the nodes, but if I take out two nodes I don't want to lose any services.
And the problem with running two instances of my service on one node is twofold:
The host port suggestion should work, but would require me to manually allocate ports, which I don't want to do.
For my purposes, and I think most others, we don't actually need max-replicas-per-node to accept any value other than 1, so possibly permit-multiple-replicas-per-node, taking priority over the desired number of replicas, would be easier to implement and satisfy the requirement.
@taliaga This is what I do but it's not clean. The scheduler looks like having an issue deploying this service.
What's the problem of using "global deploy" + "node constraints" as mentioned above?
And the problem with running two instances of my service on one node is twofold:
It doesn't achieve failover and it doesn't scale, so it's pointless.
@Yaytay wondering; do your define cpu limits (and/or reservation) on your services, or just have it use as much as is available? Without those, it would already be hard to translate "number of replicas" to actual scale (i.e, 2 replicas running on two nodes that have 10% cpu available, versus 2 instances on a single mode that has 100% cpu available for those two instances)
@thaJeztah That's a good point, I don't have (or need) CPU limitations, and with them having two instances on a single node is not pointless (thought it would be more efficient to increase the CPU limit than have two instances on one node).
@sundryp I've just found an unpleasant side-effect of your solution: it keeps trying to start the service a second time on a node and failing, then doing it again, and again, and again.
This is from looking at the event REST endpoint, which gets a continuous stream of create/destroy messages.
Which is a shame, because it means we really don't have a way to say aim for 2 instances total and no more than 1 instance on a single node :(
Would all constraints fire events this way?
It makes it impossible for my current monitoring code to identify that the swarm is stable.
Continuing on https://github.com/docker/docker/issues/26259#issuecomment-280661652 ... this is why it's not clean. Well said @Yaytay:
This is from looking at the event REST endpoint, which gets a continuous stream of create/destroy messages.
Then all your logs and monitoring systems buzz because they think their an issue.
Reading through this as a fresh swarm user, started thinking of custom resources with # of units defined per node and per task. Call them shares maybe? Just define them like labels, on the fly.
To solve the mysql problem of 3 replicas, max 1 per node and suddenly having only 2 nodes? Define each node to have 1 accept_mysql resource shares and the mysql service having a requirement of 1 accept_mysql.
This would a) ensure that each node ever has maximum of 1 mysql b) when the swarm is down to 2 nodes, the service would end up in a degraded state with 2/3 replicas up but unable to scale up to the specified level.
Other problems to solve through this:
In terms of implementation, it could be done for example through extending labels (a node label with no scalar amount = infinite shares available, a task constraint with no scalar = 1 share required).
Or by defining yet another resource type in parallel with cpu, memory, disk.. maybe even fold the current resource constraints into the same generalized system.
You could also show resources across the swarm in a single view and include the custom ones, showing that you have for example 2 accept_mysql shares available (not reserved by tasks) which tells you right away that you can lose 2 nodes without impacting the mysql replication level.
How to redistribute a service in mode replicated after a node join the swarm cluster ?
I would try something like:
docker service update scale nginx=3
Yes it's worked
I am sorry but @sundryp workaround will not work with routing mesh and labelling will not ensure that at least a number of tasks will be scheduled on diferent nodes. For better resilience in case of mainteinance of a hosts for example I would like to be sure that any of the tasks related to a service will still alive and users can "continue working" (of course application should manage this issue). So, after reading all posts, I still miss how to ensure that a replicated service have tasks scheduled on diferent hosts. Am I missed something?. This situations remains me something I asked some time ago about having a fixed number of tasks for a service, avoiding any type of scaling.
As of 1.13(?) the swarm scheduler will spread a service across as many nodes as possible, preferring HA of the service to anything (given the node matches the constraints).
There is currently no way to say "run at most N instances on a node". But the scheduler will do it's best to run as few on a particular node as possible.
I didn't know that point "scheduler will do it's best to run as few on a particular node as possible.". That's good for me in this situation. And I could re-schedule if needed. Cool, Thanks @cpuguy83
Yes HA Scheduling should be part of Docker 1.13 (see https://github.com/docker/swarmkit/pull/1446, https://github.com/docker/swarmkit/issues/308). It looks like it wasn't mentioned in the changelog though 馃槄
Is this still on the list of planned features? I personally would love to be able to constrain how many containers run on each node.
My usage example is that I have 32 API containers, which are distributed across 4 hosts. If one of the hosts dies, it adds 8 more containers to the 3 hosts left, which is a pretty bad performance hit in my case, worse then just using the 24 containers that were left. I want to be able to restrict the nodes to 8 containers max, even if that means that some of the containers can not be started.
@queicherius have you thought about other ways to structure it so your scenario fits within current swarm features? I have no idea of your infrastructure, or if this is 32 containers of a single service, but you could potentially:
Use global mode services, so they'll never run more than one task of each service on a node. Each new node that fits the placement constraint would get a new task of that service.
Use global mode above, but launch multiple, identical services (since globals limitation is one task per node per service). That way you could run X of the same image on each node, with something like api-1 api-2 for service names.
Setup monitoring to kick off actions to change the service replicas based on node count. Use Prometheus and Alert Manager to monitor for node up/down events coming out of swarm mangers and adjust service replicas based on those events. You could do this with other tools, like AWS ASG's and Lambda events for failing nodes, but I don't know your infra.
@BretFisher , good options, but they are workarounds, not solutions.
Example, I have an elastic search, I need 4 nodes that manage data, my swarm is 12 servers I don't want to have 12 servers so a global is not the solution, but I don't want the nodes to run on the same host because in this case I have 3 problems : it's useless, I consume more hard drive than what I could expect and I may run out of space, the host goes down and I loose my indexes because I loose too many shards at the same time (and this apply to many HA systems)
The workaround is to expose a port on the host that point to nothing but workaround are bad in general and this one if bad also because you can only decide to have 1/host and not x/host.
A --max-replica-per-node is the only correct solution, limiting the resource is important but does not provide the same functionality HA oriented.
The next step will be the same by label to allow to run a maximum of X containers / datacenter for example.
How about using a combination of placement.preferences with spread: node.hostname, combined with eg. placement.constraints. Haven't tested, but might give you some pseudo X per host.
in this case it's not a preference, it's a constraint, I don't want this service to run twice on the same host, at any price, better not run all the replicates than running it twice on the same host.
I don't know any constraint that allow you to do that, if you're speaking about the constraint I know : memory, cpu, the hack is to reserve more than what you need, if you do available_cpu/2+1 for example.
It's a hack and you don't allow other services to run on this machine because you "over" reserved resources you don't need. We can find many hack (and I already have the one with the network host and a random port that works) but this just bad solutions for a legitimate problem.
@dcharbonnier but that is already possible. Simply create a label constraint, and make the service global. Then you'll have one service running on each of the nodes that matches the label constraint.
yes this is something you can do, that has nothing to do with the subject and do not provide a solution for the problem but you can if you want.
An example of this feature request is Nomad's distinct_hosts constraint. Like Swarm, by default the Nomad scheduler will attempt to spread replicas across the cluster. With this constraint enabled, if there are more replicas than available nodes, it should not schedule more than one container per node (I have tested this). This is important to prevent a cascading failure when a node goes down particularly in a small cluster.
I would also find a --max-replicas-per-node option very useful. I'm using Docker Swarm Mode to distribute CPU, memory, and disk I/O intensive tasks across many nodes. For optimal performance, the task replicas should be distributed in a certain ratio (for example, on each node, there should be: 1 replica of the 1st task, 3 replicas of the 2nd, 4 replicas of the 3rd, etc). If a node drops from the swarm, I don't want its services to be reallocated to the other nodes: if this happens, performance will be negatively impacted. I will probably try to use CPU/memory allocations to prevent this from happening, but that seems like a suboptimal solution, because its an indirect way of solving the problem.
In my case, it would also work to use global and replicas. Unfortunately, "replicas can only be used with replicated mode".
@peterstory,
If a node drops from the swarm, I don't want its services to be reallocated to the other nodes: if this happens, performance will be negatively impacted. I will probably try to use CPU/memory allocations to prevent this from happening, but that seems like a suboptimal solution, because its an indirect way of solving the problem.
If performance is negatively impacted, it sounds like you _do_ need to set CPU/memory reservations anyway. It sounds to me like a _more_ direct way of solving the problem... The problem isn't really "services can't coexist on the same hardware" but more "services require certain amounts of reserved resources".
Unfortunately reserving/limiting I/O is not totally possible with Swarm Mode yet. I thought the reasons for that were discussed in depth in some issue, but all I can find right now is https://github.com/docker/swarmkit/issues/211
It sounds to me like a more direct way of solving the problem... The problem isn't really "services can't coexist on the same hardware" but more "services require certain amounts of reserved resources".
Unfortunately, due to the specialized nature of the task, it is important to have the same services running on each worker node (each task is dependent on the previous task, which means that if a certain task in the pipeline is the rate-limiting step, the other tasks will only be completed only as quickly as this task. This means that if the rate-limiting step is over-provisioned on a certain set of nodes, then the other nodes will be mostly idle). I experimented with using reserved resources today, and I observed a notable drop in throughput. I may have to switch to Kubernetes, or find some kind of hack to stick with Swarm (ex, duplicating the services in my stack file, and using global).
It would be really nice if you could plugin your own scheduler. Then people could actually experiment with their own schedulers to see if it actually helps as much as people think it would. For example, I would like it is the scheduler counted the cost of extending an overlay network to a new node, but there's no easy way of test this.
Docker Swarm's library SwarmKit is open source and has its own CLI, you could play with that rather than using the one that's built into the dockerd/cli.
+1
As a workaround, in Docker 1.13, you can define a dummy host port (13999 in this example), this will effectively limit the service to one replica per node:
docker service create
--name x
--replicas 2
-p mode=host,published=13999,target=13999
alpine:3.4 sleep 3000
The problem with using the "dummy host port hack" is that you cannot use start-first as order for the update config (Docker cannot start a 2nd instance with the same port as long as the old instance is still running). I am now trying with host mode, but I am not sure I understand why I still have to use such a work-around even with the latest Docker version.
Is there any effort to create a pull request for this ?
Do we know which projects it involves ?
+1
@thaJeztah @cpuguy83 @aluzzardi @aaronlehmann What is your thoughts that will your allow --max-replicas-per-node parameter to be included to the Docker if I will create PR of it?
We have lot of small services where just one replica is enough to handle all the load but we because of availability requirements we want to have one copy running on two physically separated datacenters on all the time so they can survive from datacenter failure.
Deploying these services with parameters --replicas 2 and --placement-pref 'spread=node.labels.datacenter' does the trick but issue is that when we do maintenance on one datacenter then swarm scheduler will migrate second replica to another datacenter and what is even worse it does not migrate it back to originally place when server comes back online.
(yes, I have seen these workarounds but IMO this should be handled by swarm scheduler).
I think it's worthwhile. But this should be proposed on the swarmkit repo.
Sure. Good. Then I will start study swarmkit scheduler logic...
I got important design question to my PR:
Here's a big question for you, though, that we'll need an answer to before we proceed: If the user is trying to do a rolling update, and they have the update set to Start First, but the maximum number of tasks is already present on every node, then the update will be unable to proceed. The user will have to first scale the service down, then do the rolling update. Is this expected and acceptable behavior?
but I'm are not currently using that feature so it does not matter for me as much so it would be better that everyone who are following this issue will tell their thoughts about it.
So let's vote about it on way that:
I will wait your votes one week and implement other parts on meantime.
EDIT: Looks that hard limit got one more vote (4/3).
I think it should be possible to get both behaviors, depending on the update_config: order specified by the user.
The first choice you listed should occur when the user specifies: update_config: order: start-first, because the overlapping container behavior is what occurs with that option. So users should not be surprised that when they have chosen that option, _--max-replicas-per-node_ is temporarily exceeded.
The second choice you listed should occur when the user specifies update_config: order: stop-first, because in this case the overlapping behavior should not occur. In this case, a user would be surprised if _--max-replicas-per-node_ was even temporarily exceeded.
@peterstory sure that extra logic would be only in use with update_config: order: start-first but question is more likely what different people want reach with this limit?
Even hard limit works fine as long user makes sure that:
--replicas 2 are --max-replicas-per-node 1But let's wait result of voting...
@olljanat My use case for this is being able to deploy HA-services like cassandra, etcd, patroni (postgres HA) etc. These kinds of systems are designed to survive failure of nodes as long as a majority are still alive.
Without max-replicas-per-node this is dangerous since on machine failure swarm might schedule two replicas on the same machine. This leads the HA-service to believe it's replicated and can proceed, when it in fact a single machine failure can bring down the majority.
Therefore I believe a hard-limit is preferable so that a degraded state is visible and can be handled accordingly.
Many HA-databases/caches replicate large amounts of data to newly started nodes. This can overload both the disk, network and ram if two tasks are started on the same machine, if the machine is sized for a single task.
@totalorder it was actually implemented and merged with hard-limit to swarmkit on https://github.com/docker/swarmkit/pull/2758
Next steps are get it merged to moby on https://github.com/moby/moby/pull/37940 (ready)
And finally to CLI on https://github.com/docker/cli/pull/1410
Target is get them out part of Docker CE 19.03 which is next version where API can change.
@olljanat Awesome! Great work! I'm a bit confused about the relationship between swarmkit and docker swarm so I missed that.
@totalorder shortly:
And actually when these are ready then Docker, Inc employees will merge them to https://github.com/docker/docker-ce for actual version release.
was this actually released/included in the latest docker release 18.09 CE ?
@si458 no, the daemon-side changes (https://github.com/moby/moby/pull/37940 and https://github.com/docker/swarmkit/pull/2758) were merged after the 18.09 branch was cut; the CLI pull request is not merged yet https://github.com/docker/cli/pull/1410
True. API version 1.40 and compose file format 3.8 will be needed for this. Hopefully we get it out part of 19.03.
Anyway, I'm bit out of ideas how to get stack support for this working (https://github.com/docker/cli/pull/1410#issuecomment-451517493) so I would appreciate is someone can help with that.
This feels like a placement constraint or perf thing to me. Maybe it should be implemented as a constraint or perf instead? So we can utilize an existing mechanism and even support more complex placement requirements. E.g. --constraint 'replicas-per-node>=3' --constraint 'replicas-per-node<2'.
Current constraints and prefs are not very powerful IMO, and could use some upgrades.
@ushuz maybe but now it is too late for that feedback. swarmkit side changes docker/swarmkit#2758 was merged already 3 months and moby changes #37940 20 days ago so I will not change implementation anymore. Only include cli side implentation to support current logic.
This is now implemented and it will be released as part of Docker 19.03.
You can see how it works with stack on https://github.com/docker/cli/pull/1410 and with without stack (docker service ...) on https://github.com/docker/cli/pull/1612
Most helpful comment
/cc @aaronlehmann
So we actually have an "HA scheduler" coming up that will attempt to spread your replicas across different machines (see https://github.com/docker/swarmkit/issues/308).
That will be the default scheduling strategy in 1.13.
Long story short, if you scale a service to 3 replicas, all of them will end up on different machines if possible.
If you don't have enough machines, then new replicas will go on the node with the least replicas already running.
Would that solve your problem?
You should be able to test on a master binary soon enough