Hi,
I see that there is a strict memory allocation policy in nomad.
Is there a way to workaround it? - I would like to have containers running from nomad to run as regular docker containers without limitations... (just like in docker swarm for example)...
Also, using swap may act as a safety nets (expecally in containers running things like spark that has complicated memory allocation).
Is there a way to allow swap usage?
Enforcing resource allocation is a great feature but it would be nice if it can be canceled for certain types of containers....
afaik oversubscription is planned for a later release - since nomad is doing binpacking oversubscription is incredible hard to mix in with that :)
Today there is no workaround for this behavior, and afaik it's a release or two (at least) in the future at this time.
So, is this means that i cannot use swap in my infrastructure if i choose to work with docker and nomad?
for now, yes i think so. I'll leave any additional comments and clarification to someone from hashicorp :)
A.
If I understand coorectly oversubscription will not solve this issus - it is more like AWS spot instances than a real solution like docker swarm.
B.
I see that Mesos has the same behavior - looks like everyone copied from Google's Borg...
This design is great where u have a cluster of physical machines, but has some disadvantages when working in the cloud.
The scheduling should be much simpler in case we are in the cloud and have dedicated machines to each service. Let the user choose which machine type to use for which service, and let the containers just run there under the docker and OS level - there is no need for nomad to allocate resources as the user already decided on this. QOS is a nice feature only in data centers and not in the cloud - there it is just limitation.
for example:
I have spark cluster, kafka cluster etc.
For each cluster i chose upfront different instance types - and i just run one container in each machine. why should i indicate resources in nomad? why can't I use swap for spark - where it makes my app stable in peak moments? it doesn't make any sense...
Docker swarm for example, is much simpler and solves this out of the box - there is no allocation for hardware, just constraints and affinities.
This approach is better for most cloud usages.
I think that by adding a flag to remove all cgroups limitations - it will solve all the problems.
This flag should generate a warning regarding QOS and that's it - in this mode the user is in charge on the QOS and not nomad.
Hey @OferE,
Your use case of Nomad is slightly different if you are doing static partitioning of nodes to types of jobs and is not the design goal of Nomad. Nomad is designed to be be run in a resource agnostic way as much as possible. Jobs should declare what they need and the decision of where to place it should be done by the scheduler. In order to guarantee that there is both enough resources on the chosen machine and that the placed jobs get the runtime performance they need we both do resource isolation and disable swap.
If the machine is swapping, performance loss is significant and in a system designed to be multi-tenant and with bin-packed machines, this is unacceptable.
If you would like to have finer grain control, we provide the raw_exec driver which allows you to make these decisions. In the future there will also be pluggable drivers so you could build your own which is less restrictive. However, for the built in drivers we won't be making that concession.
Thanks,
Alex
thanks for the info - i'll try raw_exec instead of docker driver.
I urge u to rethink about the use case of "static partitioning". Dynamically allocating resources for the cluster is not suitable for all use cases. Kafka and Spark are great examples. It just won't work there.
You need dedicated machines for them - and there is no point in limiting their processes. You want to get the entire efficiency from the cloud machine without worrying that u specified a resource incorrectly.
@OferE I am not sure why you think those require dedicated machine? They require dedicated resources. In those cases I would specify a large resource requirements such that they are guaranteed enough resources and as such in most cases they won't be multi-tenant.
I agree that there are use cases that require whole machines (databases for example). To support that case we will add a future resource option to reserve the whole node. But for most applications this is not the case.
Spark has some logic inside it and some defaults, for example to use all the cores of the machine.
Working in a cgroup environment will confuse it.
I understand your vision and I also think that someday we will get to a point that many major products will understand containerized environments and align their inner logic accordingly.
Playing and adjusting spark or worse pyspark memory allocation is not a trivial thing. Making it work for cgroups is too much at this time.
Also - reserving the entire instance just for one container is also not good for all cases - there is always other container/process that needs to run in this instances: agents for log collection and monitoring for example - in fact -it would be very nice to limit these agents and let the main container/process get wild :-)
One more thing I like to recommend to you: there is also a point of working with dev vs production.
Dev environments are significantly weaker than the production ones since you want to save money.
Writing 2 versions of nomad files (to allow 2 different resource isolation) for each of them is too limiting.
For development it will also be nice to not specify resources - since developers play with the instance type all the time.
Nomad is a great project - i like it much more than mesos/swarm/Kubernetes
If I had found a magic fish that would give me a wish - i'd ask to have the following type of resource isolation:
Minimal resource allocation - make sure my container will run in a strong env
Maximal resoure allocation - limit infra containers (monitoring/log collection).
Minimal + maximal - replicate my logic according to your design
None - dev + static partitioning...
I would have a declartive dev and production resource isolation.
This is how a perfect world looks like :-)
raw_exec driver is not working correctly: stopping the job doesn't kill the container - thus it is not a good workaround.
can you please share your job file and how the script is executed if you use any shell to wrap it? hard to help debug without any information :)
raw_exec does work just fine, so its probably that you need to TRAP a signal to make sure docker stops the container :)
Thanks - i just realized that. i didn't trap it lol.
BTW - which signal should i trap?
Per https://www.nomadproject.io/docs/operating-a-job/update-strategies/handling-signals.html it should be SIGINT :)
thank u so much for your help on this!
I just did a test and I got SIGTERM though, better test for yourself :)
It seems like it doesn't work. i trapped the SIGTERM SIGINT and my script never got it.
When i send the signal myself the script is stopped.
this is my script:
the trap never gets any signal from nomad :-(
#!/bin/bash
# handler for the signals sent from nomad to stop the container
my_exit()
{
echo "killing $CID"
docker stop --time=5 $CID # try to stop it gracefully
docker rm -f $CID # remove the stopped container
}
trap 'my_exit; exit' SIGHUP SIGTERM SIGINT
# Building docker run command
CMD="docker run -d"
for a in "$@"; do
CMD="$CMD $a"
done
echo docker wrapper: the docker command that will run is: $CMD
echo from here on it is the docker output:
echo
# actually running the command
CID=`$CMD`
# docker logs is printed in the background
docker logs -f $CID &
# allows the process to listen to signals every 3 seconds
while :
do
sleep 3
done
I found the problem :-)
The problem is in my sleep and the gracefull period.
I have sleep 3 - and kill_timeout default is 5 - this cases my script to be terminated before it gets the signal.
Changing my script sleep to 1 solved the issue.
I think i will stay with sleep 3 and set kill_timeout to explictly 45 seconds.
If you don't mind building Nomad from sources there is a trivial patch for Nomad 0.5.1
It adds "memory_mb" docker driver option which, if set to non-zero, overrides memory limit specified in task resources.
https://gist.github.com/drscre/4b40668bb96081763f079085617e6056
You can allow swap in a similar way
interesting, i'll wait for official 0.5.1 and test it.thanks for the info.
Most helpful comment
A.
If I understand coorectly oversubscription will not solve this issus - it is more like AWS spot instances than a real solution like docker swarm.
B.
I see that Mesos has the same behavior - looks like everyone copied from Google's Borg...
This design is great where u have a cluster of physical machines, but has some disadvantages when working in the cloud.
The scheduling should be much simpler in case we are in the cloud and have dedicated machines to each service. Let the user choose which machine type to use for which service, and let the containers just run there under the docker and OS level - there is no need for nomad to allocate resources as the user already decided on this. QOS is a nice feature only in data centers and not in the cloud - there it is just limitation.
for example:
I have spark cluster, kafka cluster etc.
For each cluster i chose upfront different instance types - and i just run one container in each machine. why should i indicate resources in nomad? why can't I use swap for spark - where it makes my app stable in peak moments? it doesn't make any sense...
Docker swarm for example, is much simpler and solves this out of the box - there is no allocation for hardware, just constraints and affinities.
This approach is better for most cloud usages.