Nomad: Cannot access jobs after stopping a periodic job

Created on 13 Nov 2018  路  6Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.8.6

Operating system and Environment details

Ubuntu 16.04.5 LTS

Issue

When a nomad periodic job is submitted and stopped via nomad UI, after sometime I am not able to access any job details on UI. I can access the nomad UI but I cannot get go to any job by clicking on job.
When I checked on browser console, I got the following errror

GET http://test.nomadserver:4646/v1/job/3559326d29233f205921eeacb6c29bb6 404 (Not Found)

Reproduction steps

Submit a periodic job and stop it. It sometimes take time to face this issue.

Nomad Server logs (if appropriate)

NA

Nomad Client logs (if appropriate)

NA

Job file (if appropriate)

Test job similar to what we use.

job "job" {
  datacenters = ["test-dc"]
  type        = "batch"

  periodic {
    cron             = "*/1 * * * * *"
    prohibit_overlap = true
  }

  group "monitor" {
    count = 1

    task "monitor" {
      driver = "docker"

      config {
        image = "test/image:latest"
      }

      resources {
        cpu    = 200
        memory = 30
      }
    }
  }
}

A workaround I found is to submit the job again and then stop. Then I can access the jobs again.

themui typbug

All 6 comments

Hi,
job details could be getting cleaned up due to the periodic GC.
can you try tweaking the GC timeouts (on the agents) and see if helps?

BTW */1 would be the same as just *, right?

Hi! I can confirm that we are experiencing the same issues.
After some investigation, I have found the flow to reproduce:

  1. Submit a new batch job (see example here)
  2. Wait for a new child job to launch (almost immediately in the example above)
  3. Stop the parent job
  4. The child job is now running parentless (it has to stay running / pending / etc, but it cannot exit yet, otherwise the issue won't reproduce)
  5. Wait for the GC to run (or execute it manually by PUT request to /v1/system/gc)

After step 5, where garbage collection has purged the parent job, the issue seems to reproduce.

This happens because the child is still running when Nomad UI queries for jobs (/v1/jobs).

The client then tries to get the parent job and since it has been purged, it receives 404, but for some reason it still saves the object (of the parent) in memory, without any data (null).

The javascript then fails when it tries to acess the data in memory.

Thanks for the detailed steps, @losnir, it鈥檚 the first time I鈥檓 actually able to reproduce, now hopefully I can figure out how to fix it 馃

Are you able to try this out with Nomad 0.9.2 or later? I believe it has been fixed since that version.

@backspace I can confirm that after upgrading to v0.9.5 the issue resolved.

However it looks like the upgrade introduced new issues with batch / periodic jobs (they disappear abruptly).
Will open another issue on the matter if needed.

Thanks!

Thanks for letting us know that it鈥檚 fixed. Please do open a new issue with reproduction steps for this new bug and we can look into it. I appreciate your diligence! 馃挒

Was this page helpful?
0 / 5 - 0 ratings