Nomad: Jobs and Allocations APIs show different results for jobs

Created on 18 Nov 2020  路  6Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.12.8 (b8501642d24a23c6788c267acfa2b38e50869cdf)

Seen also in 0.10.

Operating system and Environment details

Ubuntu 18.08

Issue

Allocations and Jobs APIs return different states for the same job.

Reproduction steps

  • Run a job.
  • Wait for it to be running.
  • Check its status using the allocation API.
  • Check its status using the job API.

The result from the allocation and jobs API is different:

$ curl -s 127.0.0.1:4646/v1/allocation/859e12ae-559d-72f4-8db9-0f0c8d2d088c | jq .Job > /tmp/alloc-job.json
$ curl -s 127.0.0.1:4646/v1/job/consul | jq . > /tmp/job.json
$ diff /tmp/job.json /tmp/alloc-job.json 
146c146
<   "Status": "running",
---
>   "Status": "pending",
148c148
<   "Stable": true,
---
>   "Stable": false,
152c152
<   "ModifyIndex": 17,
---
>   "ModifyIndex": 10,

Using the alloc subcommand shows everything as running:

ID                  = 859e12ae-559d-72f4-8db9-0f0c8d2d088c
Eval ID             = aab59136
Name                = consul.server[0]
Node ID             = c541f14e
Node Name           = voyager
Job ID              = consul
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 36m48s ago
Modified            = 36m34s ago
Deployment ID       = 36ec8bce
Deployment Health   = healthy

Task "consul-dev" is "running"
Task Resources
CPU          Memory          Disk     Addresses
315/100 MHz  99 MiB/300 MiB  300 MiB  

Task Events:
Started At     = 2020-11-18T11:53:05Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type                   Description
2020-11-18T12:53:05+01:00  Started                Task started by client
2020-11-18T12:53:01+01:00  Downloading Artifacts  Client is downloading artifacts
2020-11-18T12:53:01+01:00  Task Setup             Building Task Directory
2020-11-18T12:53:01+01:00  Received               Task received by client

Job file (if appropriate)

Reproduced with this job file:

job "consul" {
  datacenters = ["dc1"]

  group "server" {
    task "consul-dev" {
      driver = "raw_exec"

      config {
        command = "consul"
        args = ["agent", "-dev"]
      }

      artifact {
        source = "https://releases.hashicorp.com/consul/1.7.1/consul_1.7.1_linux_amd64.zip"
      }
    }
  }
}

Nomad Server logs (if appropriate)

    2020-11-18T13:32:00.502+0100 [DEBUG] http: request complete: method=GET path=/v1/job/consul duration=181.042碌s
    2020-11-18T13:32:01.416+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/859e12ae-559d-72f4-8db9-0f0c8d2d088c duration=607.667碌s
stagaccepted stagnot-a-bug themdocs

Most helpful comment

okay, @jsoriano , the Job field on the allocation is a copy of the job, created at allocation time, intended for use by the Nomad clients. it is only modified when there are changes to the job that allow for an in-place update of the allocation. this behavior is intentional, although it is not documented.

and even if it were documented on the API, it would be buried pretty deep in here

All 6 comments

thanks, @jsoriano, i have reproduced this ~and will push a fix.~ see below

the problem appears to be that updateJobStabilityImpl upserts a modified copy of the job, but the allocation still has a pointer to the previous version of the job:
https://github.com/hashicorp/nomad/blob/v0.12.8/nomad/state/state_store.go#L3805-L3807

okay, @jsoriano , the Job field on the allocation is a copy of the job, created at allocation time, intended for use by the Nomad clients. it is only modified when there are changes to the job that allow for an in-place update of the allocation. this behavior is intentional, although it is not documented.

and even if it were documented on the API, it would be buried pretty deep in here

@cgbaker thanks for the clarifications! I think that a description of the Job field in the Allocation API would help, even if buried pretty deep :slightly_smiling_face:

So, if we want to check the status of a job we should rely on the Jobs API and not on the Allocations one, right?

By the way, to what client refers the ClientStatus of an allocation? Could this be also used to check the status of a job?

Yes, the job endpoint will have the best information (especially if there are allocations from overlapping versions of the job). The deployment endpoint may be useful as well.

ClientStatus of the allocation refers to the actual status of the allocation on the client, as opposed to the DesiredStatus. ClientStatus will be one of the following:
https://github.com/hashicorp/nomad/blob/v0.12.8/nomad/structs/structs.go#L8499-L8503

You are correct; the Status of the job can be computed from the status of the allocations. In fact, that's how the job status is computed:
https://github.com/hashicorp/nomad/blob/v0.12.8/nomad/structs/structs.go#L3702-L3704
The only other thing to consider for job status is Stopped... a job will be "dead" if all of the allocations are terminal, but Stopped = true means that the operator set the desired state of the job to stopped, using the nomad job stop

Thanks @cgbaker!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dvusboy picture dvusboy  路  3Comments

hamann picture hamann  路  3Comments

joliver picture joliver  路  3Comments

byronwolfman picture byronwolfman  路  3Comments

clinta picture clinta  路  3Comments