Reference: https://groups.google.com/forum/#!topic/nomad-tool/t3bFTwSVgdQ
Nomad v0.5.4
Quoted from the mailing list:
We have a system job that runs on an auto scaling group (on AWS).
The instances of this group have a nomad class "foo" so the job definition is like:job "test" {
datacenters = ["dc1"]type = "system" constraint { attribute = "${node.class}" value = "foo" } [...]}
So the job will be deployed on all servers in the autoscaling group and if we scale up the group,
Nomad automatically deploys the job on the newly instantiated server.It's really cool but at the job submission, we have strange output.
Here is our (simplified) cluster nodes:
- A: Instance with class="bar
- B: Instance with class="bar"
- C: Instance with class="baz"
Autoscaling group:- D1: Instance with class="foo"
When we run the job above we have the following output:
==> Monitoring evaluation "d1e000cd"
Evaluation triggered by job "test"
Allocation "51b3d960" modified: node "a45700d3", group "test"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "d1e000cd" finished with status "complete" but failed to place all allocations:
Task Group "test" (failed to place 3 allocations):
* Class "bar" filtered 1 nodes
* Constraint "${node.class} = foo" filtered 1 nodesI think it's because a system job has only one evaluation but these numbers are weird:
- Class "bar" really filtered 2 nodes
- Constraint node.class filtered 3 nodes (or indeed 1 if we subtract the previous line)
The output contains a specific line for class "bar" but not class "baz" ? (It's pretty weird)
And, our main problem is that the status code of the "nomad run" command is 2.
+1 to this. The non-zero exist status is the real issue for us.
I'm running into this problem on our Nomad clusters at Density with Nomad 0.7. Our CI/CD pipeline attempts to plan and run jobs via the Nomad API and reports failure with system jobs. The Nomad CLI's exit code 2 appears to reflect the failed allocations coming back from the API:
I'd be happy to contribute a fix for this, but it's not totally clear what the correct behavior should be. Should there simply be more exit codes to reflect different kinds of warnings?
@dadgar any updates regarding this issue? Encountering the same issue with Nomad 0.7.1
We are experiencing this as well.
Server: Nomad v0.7.0-rc3
Client: Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)
a "quick" work-around is to submit it over the HTTP API rather than CLI and inspect the evaluation your self
i would expect any placement due to lack of resources for a system job to fail like it does today though
I ran into the same issue today as well, it looks like this is more then an exit-code issue. The scheduler reports failed allocations over HTTP API as well. (so you get the same behaviour submitting over HTTP). The allocations do get scheduled properly, but it reports the filtered nodes as failed allocations.
curl -s localhost:4646/v1/evaluation/5d16340b-1ac6-625f-db46-b59d5f8534d6 | jq -r .
{
"ID": "5d16340b-1ac6-625f-db46-b59d5f8534d6",
"Type": "system",
"TriggeredBy": "job-register",
"JobID": "foo",
...
"FailedTGAllocs": {
"tg-foo": {
"NodesEvaluated": 1,
"NodesFiltered": 1,
"NodesAvailable": {
"zone2": 14,
"zone3": 14,
"zone1": 14
},
"ClassFiltered": {
"class-a": 1
},
"ConstraintFiltered": {
"${node.class} = class-b": 1
},
"NodesExhausted": 0,
"ClassExhausted": null,
"DimensionExhausted": null,
"QuotaExhausted": null,
"Scores": null,
"AllocationTime": 30605,
"CoalescedFailures": 38
}
},
...
}
Same thing here... Running Nomad 0.7.1, whenever I use constraints with a system job in the same workflow as described in this issue I get placement errors even though the allocations are successful. It's like Nomad is treating a constrained node as a failure placement on system jobs when actually it is not!
For the record, this error still appears on 0.8.1.
Example code: https://pastebin.com/raw/f7yH5Q4U
Follow up,
After doing a fresh install of Nomad server, running the same job above, no errors exist in the UI. Errors sill persist when running the job via CLI.
I've just run into this as well. I launch my jobs from ansible, and now I have to tell ansible that exit code 2 is OK, which is .. sub optimal.
same with 0.8.4, but errcode 1...
So constrain don't work in system jobs truth CI for me
Come on guys, this is really a bug and should be dealt with. Many, if not most, people that run a service are going to constrain it to a subset of nodes. Having it throw an error for such a common use case isn't good. Here's my workaround/backflip to take care of this in Ansible. At least it'll let some errors get trapped.
failed_when: 'jobresult.rc != 0 and not jobresult.stdout.find("finished with status \"complete\" but failed to place all allocations:") > -1'
Any news on this?
@SomKen @jsaintro we will tackle this in the next minor release (0.9.1) after 0.9 is out. Sorry for the delay but our highest priority now is to finish the large 0.9 release which brings in GPU support, runtime plugins and more advanced scheduling improvements.
We are getting hit by this. Hope you fix this fast
Most helpful comment
Same thing here... Running Nomad 0.7.1, whenever I use constraints with a system job in the same workflow as described in this issue I get placement errors even though the allocations are successful. It's like Nomad is treating a constrained node as a failure placement on system jobs when actually it is not!