0.12.4 (both servers/clients)
I have a "service" type job running with count=1.
When I stop an alloc, Nomad starts 2 new allocs instead.
It was ok on 0.12.1.
Reverted back to 0.12.1 (servers/clients) and the behaviour changed back to normal - you stop an alloc and Nomad starts 1 alloc.
Hi @roman-vynar ! Thanks for reaching out! I'm investigating this but sadly unable to reproduce it - can you please provide more detailed instructions along with sample output and logs?
Here is my attempt at reproduction - not that it only has a single running alloc at the end.
Script
mars-2:aa notnoop$ nomad job init --short
Example job file written to example.nomad
mars-2:aa notnoop$ nomad job run ./example.nomad
==> Monitoring evaluation "dd48655b"
Evaluation triggered by job "example"
Allocation "1c3f8de9" created: node "ce7fbaff", group "cache"
Evaluation within deployment: "5237c8eb"
Allocation "1c3f8de9" status changed: "pending" -> "running" (Tasks are running)
Evaluation status changed: "pending" -> "complete"
==> Evaluation "dd48655b" finished with status "complete"
mars-2:aa notnoop$ nomad job status
ID Type Priority Status Submit Date
example service 50 running 2020-09-10T09:18:13-04:00
mars-2:aa notnoop$ nomad job status example
ID = example
Name = example
Submit Date = 2020-09-10T09:18:13-04:00
Type = service
Priority = 50
Datacenters = dc1
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
cache 0 0 1 0 0 0
Latest Deployment
ID = 5237c8eb
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
cache 1 1 1 0 2020-09-10T09:28:23-04:00
Allocations
ID Node ID Task Group Version Desired Status Created Modified
1c3f8de9 ce7fbaff cache 0 run running 12s ago 1s ago
mars-2:aa notnoop$ nomad alloc stop 1c3f8de9
==> Monitoring evaluation "141f5057"
Evaluation triggered by job "example"
Allocation "c08b66d5" created: node "ce7fbaff", group "cache"
Evaluation within deployment: "5237c8eb"
Allocation "c08b66d5" status changed: "pending" -> "running" (Tasks are running)
Evaluation status changed: "pending" -> "complete"
==> Evaluation "141f5057" finished with status "complete"
mars-2:aa notnoop$ nomad job status example
ID = example
Name = example
Submit Date = 2020-09-10T09:18:13-04:00
Type = service
Priority = 50
Datacenters = dc1
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
cache 0 0 1 0 1 0
Latest Deployment
ID = 5237c8eb
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
cache 1 1 1 0 2020-09-10T09:28:23-04:00
Allocations
ID Node ID Task Group Version Desired Status Created Modified
c08b66d5 ce7fbaff cache 0 run running 9s ago 8s ago
1c3f8de9 ce7fbaff cache 0 stop complete 34s ago 8s ago
Back to 0.12.4. It is constantly reproducible:
$ nomad node status -verbose
ID DC Name Class Address Version Drain Eligibility Status
d3e83667-d812-69f8-b279-4352e115aafd roman nomad-10-0-7-219 <none> 127.0.0.1 0.12.4 false eligible ready
866b33a1-6826-3ac4-19a9-24ac4413b9c5 roman nomad-10-0-7-165 <none> 127.0.0.1 0.12.4 false eligible ready
c0b8877e-6925-ae7b-bb19-103594acce97 roman nomad-10-0-7-233 <none> 127.0.0.1 0.12.4 false eligible ready
991c40ba-40d9-73bc-f642-37562934002d roman nomad-10-0-7-171 <none> 127.0.0.1 0.12.4 false eligible ready
c4db3ccb-e9c3-f451-2812-f844935cbf98 roman nomad-10-0-7-87 <none> 127.0.0.1 0.12.4 false eligible ready
$ nomad job status ax-man
ID = ax-man
Name = ax-man
Submit Date = 2020-09-10T16:07:59+03:00
Type = service
Priority = 50
Datacenters = roman
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
ax-man 0 0 1 0 16 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
9bf9521e c0b8877e ax-man 59 run running 33m28s ago 33m13s ago
$ nomad alloc stop 9bf9521e
==> Monitoring evaluation "bc8db7f9"
Evaluation triggered by job "ax-man"
Allocation "1769c470" created: node "c0b8877e", group "ax-man"
Allocation "d6c7aecc" created: node "c4db3ccb", group "ax-man"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "bc8db7f9" finished with status "complete"
$ nomad job status ax-man
ID = ax-man
Name = ax-man
Submit Date = 2020-09-10T16:07:59+03:00
Type = service
Priority = 50
Datacenters = roman
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
ax-man 0 0 2 0 17 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
1769c470 c0b8877e ax-man 59 run running 13s ago 13s ago
d6c7aecc c4db3ccb ax-man 59 run running 13s ago 13s ago
9bf9521e c0b8877e ax-man 59 stop complete 34m24s ago 13s ago
Notice the following message from stop command:
Allocation "1769c470" created: node "c0b8877e", group "ax-man"
Allocation "d6c7aecc" created: node "c4db3ccb", group "ax-man"
Job def specs what applicable:
update {
max_parallel = 1
health_check = "checks"
min_healthy_time = "10s"
healthy_deadline = "5m"
progress_deadline = "10m"
auto_revert = false
auto_promote = true
canary = 1
}
count = 1
restart {
attempts = 0
interval = "30m"
delay = "30s"
mode = "fail"
}
reschedule {
unlimited = true
delay = "15s"
delay_function = "exponential"
max_delay = "5m"
}
driver = "docker"
5m later it is still running:
Summary
Task Group Queued Starting Running Failed Complete Lost
ax-man 0 0 2 0 17 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
1769c470 c0b8877e ax-man 59 run running 5m14s ago 5m1s ago
d6c7aecc c4db3ccb ax-man 59 run running 5m14s ago 5m2s ago
9bf9521e c0b8877e ax-man 59 stop complete 39m25s ago 5m14s ago
Please let me know if you need anything else.
Thanks for quick response!
Thank you very much for the report. I have confirmed the problem and pushed a solution to rollback to old canary behavior to avoid this regression. It seems that it disproportionately affect single alloc service jobs with canary deployments.
Most helpful comment
Thank you very much for the report. I have confirmed the problem and pushed a solution to rollback to old canary behavior to avoid this regression. It seems that it disproportionately affect single alloc service jobs with canary deployments.