Nomad: [feature] Timeout for batch jobs

Created on 3 Oct 2016  路  21Comments  路  Source: hashicorp/nomad

I'm currently running a handful of periodic batch jobs. It's great that nomad doesn't schedule another one if a current one is still running. However, I think it would be helpful to stop a batch job on timeout if it's running beyond a set time. Maybe there could be a script run on timeout or the script itself would just have to handle the signal.

stagneeds-discussion themjobspec typenhancement

Most helpful comment

Just adding a use case here. Let's say I have an app and implement a timeout inside the app. Let's assume there's a bug in this app which causes it to hang occasionally under certain conditions and it thus never reaches it's hard coded "timeout" because it's essentially stopped responding. We should have a way in the infrastructure to automate a relatively simple, drop dead timeline where the scheduler will kill a task that's unresponsive.

Nomad is better equipped to provide that failsafe protection at the infrastructure level than rewriting timeout code in every app simply because it doesn't rely on the app's runtime to perform that kill.

All 21 comments

Hey Sheldon,

You could accomplish this yourself by putting a little script in-between what you actually want to run that waits till either the task finishes or the timeout and then returns exit 1

That's how I'm handling it right now but I was thinking it would be cool if Nomad could do it. I understand if it seems like bloat though :)

+1 - important feature for batch runs. Not so clean to handle this ourselves

I think this function should be available not only for batch jobs but also for regular services, this will help us to implement "chaos monkey" function right inside nomad. This will increase system stability, because it will be ready for downtime of any service.

As mentioned in Gitter chat, the timeout binary in coreutils can do this inside the container if you need a fix right now.

timeout 5 /path/to/slow/command with options

I think it will be better to add "max_lifetime" and add ability to specify it as a range or concrete value. For example 10h-20h means that daemon might be killed in 11h or after 19h, but maximum time will be 20h. Implementing chaosmonkey in such way will be great feature in my opinion, and you don't need any 3rd party apps =)

If a timeout function is implemented it can be used to mimic the classic HPC schedulers like PBS, TORQUE, SGE, etc.

Having it as a first-class feature would be indeed useful for many folks including me!
Hope this _does_ get implemented.

Thanks and Regards,
Shantanu

Just adding a use case here. Let's say I have an app and implement a timeout inside the app. Let's assume there's a bug in this app which causes it to hang occasionally under certain conditions and it thus never reaches it's hard coded "timeout" because it's essentially stopped responding. We should have a way in the infrastructure to automate a relatively simple, drop dead timeline where the scheduler will kill a task that's unresponsive.

Nomad is better equipped to provide that failsafe protection at the infrastructure level than rewriting timeout code in every app simply because it doesn't rely on the app's runtime to perform that kill.

I agree, but as a temporary workaround, how about the timeout with the kill timout parameter?

http://man7.org/linux/man-pages/man1/timeout.1.html

+1 for this, it's basic functionality for a job scheduler. Amazing this doesn't exist. @mlehner616 is obviously correct about why having the timeout checker inside the container itself is a boneheaded recommendation. We got bit by 3 hung jobs out of 100,000 that prevented our elastic infrastructure from scaling back down, costing a nice chunk of change.

@Miserlou as mentioned earlier in this thread a workaround would be to wrap your app in a timeout script. There is an example of how you can do it above. That might save your beckon in the scenario you described.

Timeout for batch jobs is an important safeguard. We can't rely on jobs' good behaviour... Job without time limit is a _service_ hence timeout is crucial to constrain buggy tasks that might be running for too long...

I'd also very much like to see nomad implement this, for the use case where nomad's parameterized jobs are used as a bulk task processing system, similar to the workflow described here: https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch

There are major advantages for us to use this workflow, as it takes advantage of infrastructure already in place to handle autoscaling, rather than having to set up a new system using celery or similar task queues. The lack of a built in timeout mechanism for batch jobs makes the infrastructure required for this, fairly common (afaik) use case quite a bit more complex.

Handling the timeout in the tasks themselves is not a safe approach, for the reasons mentioned above, and would also increase the complexity of individual tasks, which is not ideal. Therefore the dispatcher must manage the timeout itself and kill batch jobs once it has been reached. This makes it inconvenient to manage jobs which need different timeouts using a single bulk task management system, as configuration for these needs to be stored centrally, separate from the actual job specification.

There are workarounds for this, but it would be very nice to see nomad itself handle timeouts, both for safety and to simplify using nomad.

@Magical-Chicken - I strongly, strongly recommend you avoid using Nomad for that purpose. There are claims made in that blog post which are simply untrue, and many people are being duped by it.

See more here:
https://github.com/hashicorp/nomad/issues/4323#issuecomment-426419394

@Miserlou Thanks for the heads up, that is a pretty serious bug in nomad, and is pretty concerning since we have a large amount of infrastructure managed by it. The volume of dispatches we are handling currently isn't too high, so I'm hoping nomad will be ok to use here in the short term, but long term I will definitely consider switching this system over to a dedicated task queue.

Nomad will crash with out of memory
Hopefully hashicorp intents to fix this, maybe they could add a configuration option for servers to use a memory mapped file to store state rather than risking an OOM kill, or even have servers start rejecting additional job registrations if they're running out of memory. There's really no case where it is acceptable for servers to crash completely, or secondary servers to fail to elect a new leader after the leader is lost.

+1 are the any plans to include this feature any time soon? Seems pretty important. Wrapping tasks in a timeout script is a bit hacky.

+1

+1

A job run limit is an essential feature of a batch scheduler. All major batch schedulers (PBS, slurm, LSF, etc) have this capability. I鈥檝e seen a growing interest in a tool like nomad, something that combines many of the features of a traditional batch scheduler with K8. But without a run time limit feature, integration into a traditional batch environment would be next to impossible. Is there any timeline on adding this feature to nomad?

+1

@karlem you should add a +1 to the first post rather that a seaprate message.
That's how they track the demand of a feature.

If you know more folks who might be interested in this, you should encourage them to do so as well! 馃槈

Was this page helpful?
0 / 5 - 0 ratings