Nomad: Nomad 0.12.4 stop allocation issue

Created on 10 Sep 2020  路  3Comments  路  Source: hashicorp/nomad

Nomad version

0.12.4 (both servers/clients)

Issue

I have a "service" type job running with count=1.
When I stop an alloc, Nomad starts 2 new allocs instead.

It was ok on 0.12.1.
Reverted back to 0.12.1 (servers/clients) and the behaviour changed back to normal - you stop an alloc and Nomad starts 1 alloc.

stagaccepted themscheduling typbug

Most helpful comment

Thank you very much for the report. I have confirmed the problem and pushed a solution to rollback to old canary behavior to avoid this regression. It seems that it disproportionately affect single alloc service jobs with canary deployments.

All 3 comments

Hi @roman-vynar ! Thanks for reaching out! I'm investigating this but sadly unable to reproduce it - can you please provide more detailed instructions along with sample output and logs?

Here is my attempt at reproduction - not that it only has a single running alloc at the end.

Script

mars-2:aa notnoop$ nomad job init --short
Example job file written to example.nomad
mars-2:aa notnoop$ nomad job run ./example.nomad
==> Monitoring evaluation "dd48655b"
    Evaluation triggered by job "example"
    Allocation "1c3f8de9" created: node "ce7fbaff", group "cache"
    Evaluation within deployment: "5237c8eb"
    Allocation "1c3f8de9" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "dd48655b" finished with status "complete"
mars-2:aa notnoop$ nomad job status
ID       Type     Priority  Status   Submit Date
example  service  50        running  2020-09-10T09:18:13-04:00
mars-2:aa notnoop$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2020-09-10T09:18:13-04:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       0         0

Latest Deployment
ID          = 5237c8eb
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
cache       1        1       1        0          2020-09-10T09:28:23-04:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
1c3f8de9  ce7fbaff  cache       0        run      running  12s ago  1s ago
mars-2:aa notnoop$ nomad alloc stop 1c3f8de9
==> Monitoring evaluation "141f5057"
    Evaluation triggered by job "example"
    Allocation "c08b66d5" created: node "ce7fbaff", group "cache"
    Evaluation within deployment: "5237c8eb"
    Allocation "c08b66d5" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "141f5057" finished with status "complete"
mars-2:aa notnoop$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2020-09-10T09:18:13-04:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       1         0

Latest Deployment
ID          = 5237c8eb
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
cache       1        1       1        0          2020-09-10T09:28:23-04:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
c08b66d5  ce7fbaff  cache       0        run      running   9s ago   8s ago
1c3f8de9  ce7fbaff  cache       0        stop     complete  34s ago  8s ago

Back to 0.12.4. It is constantly reproducible:

$ nomad node status -verbose
ID                                    DC     Name              Class   Address    Version  Drain  Eligibility  Status
d3e83667-d812-69f8-b279-4352e115aafd  roman  nomad-10-0-7-219  <none>  127.0.0.1  0.12.4   false  eligible     ready
866b33a1-6826-3ac4-19a9-24ac4413b9c5  roman  nomad-10-0-7-165  <none>  127.0.0.1  0.12.4   false  eligible     ready
c0b8877e-6925-ae7b-bb19-103594acce97  roman  nomad-10-0-7-233  <none>  127.0.0.1  0.12.4   false  eligible     ready
991c40ba-40d9-73bc-f642-37562934002d  roman  nomad-10-0-7-171  <none>  127.0.0.1  0.12.4   false  eligible     ready
c4db3ccb-e9c3-f451-2812-f844935cbf98  roman  nomad-10-0-7-87   <none>  127.0.0.1  0.12.4   false  eligible     ready
$ nomad job status ax-man
ID            = ax-man
Name          = ax-man
Submit Date   = 2020-09-10T16:07:59+03:00
Type          = service
Priority      = 50
Datacenters   = roman
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
ax-man      0       0         1        0       16        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
9bf9521e  c0b8877e  ax-man      59       run      running  33m28s ago  33m13s ago
$ nomad alloc stop 9bf9521e
==> Monitoring evaluation "bc8db7f9"
    Evaluation triggered by job "ax-man"
    Allocation "1769c470" created: node "c0b8877e", group "ax-man"
    Allocation "d6c7aecc" created: node "c4db3ccb", group "ax-man"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "bc8db7f9" finished with status "complete"
$ nomad job status ax-man
ID            = ax-man
Name          = ax-man
Submit Date   = 2020-09-10T16:07:59+03:00
Type          = service
Priority      = 50
Datacenters   = roman
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
ax-man      0       0         2        0       17        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
1769c470  c0b8877e  ax-man      59       run      running   13s ago     13s ago
d6c7aecc  c4db3ccb  ax-man      59       run      running   13s ago     13s ago
9bf9521e  c0b8877e  ax-man      59       stop     complete  34m24s ago  13s ago

Notice the following message from stop command:

    Allocation "1769c470" created: node "c0b8877e", group "ax-man"
    Allocation "d6c7aecc" created: node "c4db3ccb", group "ax-man"

Job def specs what applicable:

  update {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
    progress_deadline = "10m"
    auto_revert = false
    auto_promote = true
    canary = 1
  }

    count = 1

    restart {
      attempts = 0
      interval = "30m"
      delay = "30s"
      mode = "fail"
    }

    reschedule {
      unlimited      = true
      delay          = "15s"
      delay_function = "exponential"
      max_delay      = "5m"
    }

      driver = "docker"

5m later it is still running:

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
ax-man      0       0         2        0       17        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
1769c470  c0b8877e  ax-man      59       run      running   5m14s ago   5m1s ago
d6c7aecc  c4db3ccb  ax-man      59       run      running   5m14s ago   5m2s ago
9bf9521e  c0b8877e  ax-man      59       stop     complete  39m25s ago  5m14s ago

Please let me know if you need anything else.
Thanks for quick response!

Thank you very much for the report. I have confirmed the problem and pushed a solution to rollback to old canary behavior to avoid this regression. It seems that it disproportionately affect single alloc service jobs with canary deployments.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

clinta picture clinta  路  3Comments

byronwolfman picture byronwolfman  路  3Comments

jippi picture jippi  路  3Comments

ashald picture ashald  路  3Comments

jrasell picture jrasell  路  3Comments