Nomad: Automatic jobs (via a .d directory)

Created on 27 Sep 2016 · 9Comments · Source: hashicorp/nomad

It would be very useful if Nomad could support "auto-start" jobs (particularly for system jobs). A good use case here is for something like a stats agent that should run on every host.

In Nomad's current state, the bootstrapping process involves bringing up a new cluster, waiting for convergence, and then SSHing into the bastion host, writing a job file to disk or submitting via the API with nomad run job.nomad. This is not very operationally-friendly as it requires someone to sit and wait for the cluster to converge to submit the job.

I think it would be a good feature if Nomad supported a nomad.d directory of job files that it would automatically nomad run at boot. These jobs could be dropped off by CM or just a provisioner, but they would automatically run at boot.

My current use case involves fabio, the Consul load balancer, and Terraform. I want to use Terraform to spin up a bunch of hosts that already have fabio running under Nomad. This is currently a multi-step process:

terraform apply
Wait for instances to initialize
SSH into bastion host
Create job file
nomad run job.nomad

If Nomad supported automatic jobs, this could be simplified to a provisioner that writes the job file to disk in a .d directory and just:

terraform apply

I think there are other valid use cases too though:

Statsd or graphite agents
Custom sysloggers
Consul (chicken-and-egg problem, but still worth considering)
Anomaly detection software
vault-ssh-helper

What do y'all think?

themapi typenhancement

Source

sethvargo

👍7 ❤2

Most helpful comment

That would be a really useful feature to have cluster configure itself into desired state and run some jobs. I have exact same use case, except loadbalancer is different. I automated it with terraform bootstraping cluster and exposing cluster's IP via loadbalancer or registering them in Route53.

Then I run another terraform plan that submits "system" jobs and various "helpers" into fresh nomad cluster. It would be nice if terraform could define what jobs nomad should run after nomad had been bootstrapped.

rokka-n on 23 Jun 2017

👍2

All 9 comments

I think if the behavior is not a system job, it is rather undefined what Nomad should be doing with the job. What happens when the server is restarted, does it resubmit the job. If the answer is only if it isn't already submitted, well it could have been so long ago that it GC'd.

I think the better solution would be to have a terraform module that submitted the job to Nomad.

dadgar on 27 Sep 2016

Hi @dadgar

I think this would only apply to system jobs. All my examples referred to system jobs; sorry for not making that clearer.

The problem with having a Terraform module to submit the jobs is that you now require Terraform has keys to SSH into production, it has to be "nomad-aware" to wait for the cluster to converge, etc.

sethvargo on 27 Sep 2016

Sounds good!

dadgar on 27 Sep 2016

This seems like something that could be done via an external program that monitors a directory and runs nomad run on a user's behalf each time a file is created or modified, retrying if it fails so that it can deal with coming up before the nomad cluster is ready. Such a program could alternatively monitor a prefix in Consul, and then you'd have #1038.

apparentlymart on 28 Sep 2016

👍2

In a typical production / HA deployment, there are multiple servers. This leads me to a few questions:

Would nomad run jobs written to one server, but not another?
Let's suppose you write out job files to all of your servers, but then later one of those files is updated, how would nomad handle that inconsistency?
If you have those files, and nomad has run those jobs, how do you tell nomad to stop running the job? Do you rm the file, or do you use nomad stop ...?

Overall, I love the idea of addressing the underlying operational issue here (eg, I have automatic deployment for all sorts of stuff in my stack, but then an operator has to go run a few jobs manually to kickstart them on nomad). The solution I thought of to address the questions I noted above is to have nomad look to consulkv for those jobs - #1038.

RE the chicken and egg problem with consul, my plan to address that is:

when a nomad server boots, check if a key in consul exists, if it does not, write it
the key that is written would tell nomad to run a job that runs a service to manage what is in consulkv (I have the service already, but I need to manually run it)
that service would write a bunch more kv to consul, some of which would tell nomad to run more jobs
as time progresses, operators would update files in git, which would be mirrored to consulkv, and nomad would react to those changes

I find this workflow a bit easier to manage than writing files to a .d directory for nomad (I generally write out files based on what's in consul), though I think there is a lot of utility in having nomad run jobs from a .d path too.

ketzacoatl on 30 Sep 2016

Would nomad run jobs written to one server, but not another?

It doesn't actually matter. The jobs are submitted to the server and evaluated in order. Only the leader would receive the schedule and if it received it 10x, it would submit once and then say "I already did this 9x"

Let's suppose you write out job files to all of your servers, but then later one of those files is updated, how would nomad handle that inconsistency?

I would say the job files are only ready during boot/config reload. In that case, changing the file would do nothing until you bounce that server. In that case, the result would be the same as nomad run on that file.

If you have those files, and nomad has run those jobs, how do you tell nomad to stop running the job? Do you rm the file, or do you use nomad stop ...?

Good question. Remember, this is restricted to _system jobs_, so it's highly unlikely you would ever stop such a job. However, if you did want to stop it, you would nomad stop. Again, the files are only loaded at boot, so changes only take affect when you reload. If you wanted to make sure the job did not start again on reload, you would stop and rm the file.

sethvargo on 30 Sep 2016

Just bumping this thread, I have gone through the same process as Seth setting up a Nomad cluster with Fabio, and it feels wrong to have to write this into the startup-config.

# Start the fabio system job
echo "Submitting fabio job..."
sudo tee /tmp/fabio.hcl > /dev/null <<"EOF"
${fabio_job}
EOF

until nomad run /tmp/fabio.hcl; do
  echo "Job failed to submit..."
  sleep 2
done

Would be fantastic to be able to start a job at initialization time by specifying the job in the nomad config.

nicholasjackson on 11 Apr 2017

rokka-n on 23 Jun 2017

👍2

Closing due to inactivity. I'm trying to get the list of issues I've submitted under control, and it doesn't look like there's interest in building this functionality at this time.

sethvargo on 3 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings