Nomad: Metrics for Placement Failures

Created on 22 Nov 2019 · 6Comments · Source: hashicorp/nomad

Hi everyone,
I'm trying to configure my aws autoscaling group to add or remove nomad clients automatically on usage.
The problem is that i can't find a metric for placement failures if resources are missing.
When I have a placement failure, the metrics "nomad_client_allocations_pending", "nomad_client_allocations_blocked" are always at 0.
Is there any metrics that returns the number of placement failures ?

Nomad Client logs (if appropriate)

nomad run -verbose nomad.job
==> Monitoring evaluation "7c3c156b-9fe8-4851-622a-8004ede7faf9"
Evaluation triggered by job "My-huge-job"
Evaluation within deployment: "825fd638-8c18-f676-0ffe-e75ac79ab269"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "7c3c156b-9fe8-4851-622a-8004ede7faf9" finished with status "complete" but failed to place all allocations:
Task Group "application" (failed to place 1 allocation):
* Resources exhausted on 3 nodes
* Dimension "cpu" exhausted on 3 nodes
Evaluation "6f5b244d-60e5-d4ac-2487-2a9ad0b4b8e7" waiting for additional capacity to place remainder

themmetrics typquestion

Source

anthonymq

All 6 comments

Hi @anthonymq! There isn't a metric directly for placement failures, but there's a few other metrics you might want to dig into to get the results you want:

nomad_nomad_blocked_evals_total_blocked
nomad_nomad_job_status_pending
nomad_nomad_job_summary_queued

can all give you insight into whether allocations aren't being placed.

Something I've done in the past when running Nomad on AWS is to push jobspecs from my CI/CD pipeline into SQS. Then have a little queue consumer pulling jobs off the queue and nomad run them. When it got errors, I'd have it re-queue the job in SQS, and I'd use the SQS queue depth as the metrics that AWS autoscaling was watching.

tgross on 22 Nov 2019

👀1

I'm sorry but I can't find the metrics you are talking about in my prometheus.
Otherwise your solution is pretty interesting !

anthonymq on 25 Nov 2019

Is your prometheus scraping both the client and the servers?

tgross on 25 Nov 2019

Alright I forgot to activate the telemetry on the servers !
Now i see those metrics.
I can easily see the requests to add new clients if a job is blocked but do you have any idea for the downsize query ?
Should I only look for clients with 0 running jobs to terminate them ?

anthonymq on 25 Nov 2019

I can easily see the requests to add new clients if a job is blocked but do you have any idea for the downsize query ?
Should I only look for clients with 0 running jobs to terminate them ?

Right! And you'll want to run a node drain on them first so that the servers don't schedule work to them while you're in the process of terminating them. You _could_ do this by adding a script that runs nomad node drain -self to the instance shutdown process but I wouldn't recommend that because it'll be hard to guarantee it completes before the host shuts down. There's also AWS autoscaling lifecycle hooks, which you could wire up to send the nomad node drain and wait to complete the shutdown until that's succeeded.

I'm going to close this issue. But I'd love to see what you come up with here!

tgross on 25 Nov 2019

🚀1

Thanks for the help !
Have a good day !

anthonymq on 26 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings