Nomad: [0.8.4] Nomad UI Hanging on Job detail Viewing

Created on 10 Jul 2019  路  15Comments  路  Source: hashicorp/nomad

Re-opening previously closed issue. This issue is not resolved and definitely seems like a Nomad bug.

Nomad version

0.8.4

Operating system and Environment details

Ubuntu 16.04.4 LTS

Issue

In the Nomad UI, when you click on a job or client to view allocations, the UI hangs attempting to load allocations instead of loading the actual allocations.

I was able to still use the CLI to run nomad status to view the jobs and view their underlying allocations through nomad status

Reproduction steps

If there is a periodic job launch that is still in the allocations list, but the parent job had aged out (in our case the periodic job is just stuck in pending).

Other logs

404 in web browser console details in https://user-images.githubusercontent.com/6162849/46262074-e7078f80-c52e-11e8-9102-44d112cf3e9e.png as per https://github.com/hashicorp/nomad/issues/4464 . Note that this screenshot is from another user, but it looks to be a similar problem - the console records a single job showing a 404 when loading /jobs splash page.

Note that browsing directly to a valid job seems to work (as in changing the URL to /jobs/<JOBNAME> manually), but it seems that this job that Nomad is confused about giving the 404 back on the main /jobs page is preventing clicking-in to any job from loading.

themui typenhancement

Most helpful comment

I'm going to deploy Nomad 0.12.1 across the infrastructure soon and I'll report back.

Thanks @DingoEatingFuzz 馃槄

All 15 comments

Hello, thanks for the report. Are you able to try this out with Nomad 0.9.2 or later? It looks to me that this problem was fixed with UI updates in that version.

Will try it out - I'm waiting on 0.9.4 to start upgrading our fleet.

i had reproduced this issue using nomad 0.11.0

NOMAD-UI-STUCK-WHEN-PERIODIC-JOB-EXPIRED
NOMAD-UI-STUCK-WHEN-PERIODIC-JOB-EXPIRED-JOBS

2020-04-20T12:26:35.596Z [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=C:\ProgramData\Kryon\nomad\server\plugins
2020-04-20T12:26:36.050Z [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2020-04-20T12:26:36.050Z [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2020-04-20T12:26:36.050Z [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2020-04-20T12:26:36.050Z [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
2020-04-20T12:26:36.051Z [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2020-04-20T12:26:39.164Z [INFO] nomad.raft: initial configuration: index=9 servers="[{Suffrage:Voter ID:192.168.15.123:4647 Address:192.168.15.123:4647} {Suffrage:Voter ID:192.168.12.250:4647 Address:192.168.12.250:4647}]"
2020-04-20T12:26:39.165Z [INFO] nomad.raft: entering follower state: follower="Node at 192.168.15.123:4647 [Follower]" leader=
2020-04-20T12:26:39.309Z [INFO] nomad: serf: EventMemberJoin: Kryon15-123.global 192.168.15.123
2020-04-20T12:26:39.309Z [INFO] nomad: serf: Attempting re-join to previously known node: Kryon12-250.global: 192.168.12.250:4648
2020-04-20T12:26:39.314Z [INFO] nomad: starting scheduling worker(s): num_workers=4 schedulers=[batch, system, service, _core]
2020-04-20T12:26:39.462Z [INFO] nomad: adding server: server="Kryon15-123.global (Addr: 192.168.15.123:4647) (DC: kryon-dev)"
2020-04-20T12:26:39.678Z [INFO] nomad: serf: EventMemberJoin: Kryon12-250.global 192.168.12.250
2020-04-20T12:26:39.678Z [WARN] nomad: memberlist: Refuting a suspect message (from: Kryon15-123.global)
2020-04-20T12:26:39.679Z [INFO] nomad: adding server: server="Kryon12-250.global (Addr: 192.168.12.250:4647) (DC: kryon-dev)"
2020-04-20T12:26:39.679Z [INFO] nomad: serf: Re-joined to previously known node: Kryon12-250.global: 192.168.12.250:4648
2020-04-20T12:26:39.819Z [WARN] nomad.raft: failed to get previous log: previous-index=3016 last-index=3015 error="log not found"
2020-04-20T12:50:25.630Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL support disabled" code=400
2020-04-20T12:50:25.658Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpoint" code=501
2020-04-20T12:50:26.177Z [ERROR] http: request failed: method=GET path=/v1/job/redis-config-updater error="job not found" code=404
2020-04-20T12:51:24.844Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL support disabled" code=400
2020-04-20T12:51:24.874Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpoint" code=501
2020-04-20T12:51:25.293Z [ERROR] http: request failed: method=GET path=/v1/job/redis-config-updater error="job not found" code=404
2020-04-20T12:54:48.182Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL support disabled" code=400
2020-04-20T12:54:48.219Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpoint" code=501
2020-04-20T12:54:48.656Z [ERROR] http: request failed: method=GET path=/v1/job/redis-config-updater error="job not found" code=404
NOMAD-PERIODIC-JOBS-ISSUE-VERSION

i had managed to overcome the problem. i had stoped and purged a periodic job by name (redis-config-updater) by stop --purge.
it seems that the jobs were still there - dead but with periodic postfix - all periodic batch jobs that where seen on nomad job status.

below is a script i had run to remove all dead periodic jobs - now the ui is responsive.

i am not sure if the ui is stuck due to amount of jobs - it sure looks like it - anyway after the purge the ui become responsive again.

@echo off
setlocal EnableExtensions EnableDelayedExpansion
cls

for /f "tokens=1,2,4" %%a in ('..\nomad.exe status --address http://192.168.15.123:4646') do (
echo %%a|find "redis-config-updater" >nul
if errorlevel 1 (echo notfound) else (if "%%b"=="batch" if "%%c"=="dead" ..\nomad.exe stop -purge --address http://192.168.15.123:4646 %%a)

)

I can confirm we are seeing this issue on 0.10.4. Also seems related to stopped periodic jobs once they are garbage collected.

I can confirm we just hit this issue when we upgraded Nomad to 0.11.3.

UI was hanging, /gc endpoint didn't fix it.

Purging pending batch jobs fixed it for us by running the following (鈿狅笍 Be careful and make sure you understand the impact before hand to be sure 鈿狅笍 ):

nomad job status | grep -i pending | grep -i batch | awk '{print $1}' | xargs -I% -P2 sh -c '{ nomad stop -purge %; }'

UI works again.

I looked into this awhile back and ran into deep issues within Ember itself that have since been fixed in newer versions.

I'll be revisiting this after we finish our UI tech debt work that includes an Ember upgrade: #7834

We've recently run into the same issue. It happend on one of our clusters that's still running an ancient version of Nomad: v0.8.3. That cluster has been running for some years now and it's the first time we've encountered this issue (it happend after a significant network outage). We're currently migrating this cluster to the latest Nomad release (actually we're rebuilding it). But apparently that doesn't matter since it's also present in current releases.

Lucky the workaround posted by @scalp42 works perfectly. So for anyone else bumping into this issue: execute the bash oneliner posted by @scalp42 and the Nomad UI should be working again.

Hey folks, I'm seeing this issue several times a day. The one-liner from @scalp42 works sometimes, but the issue also happens with dead batch jobs. Forcing a nomad system gc solved the issue, though.

I also ran into this today; seems to be related to some pending parameterized jobs. @scalp42 's fix worked for me.

Yup, I have lately been having the same issue with parameterized jobs. Opening multiple tabs to jobs reproduces this. UI hangs and then the web server becomes completely clogged (with error on Chrome ERR_EMPTY_RESPONSE) until some time when it comes back on its own.

Hi everyone!

Thank you for your patience with this bug, and especially thank you @scalp42 for the one-liner workaround. I believe this is now fixed in v0.12.1. See the explanation of the solution here.

Given the number of reports this bug has gotten, I don't want to close this issue until there has been some community confirmation. Please try out 0.12.1 and see if this fixes the issue for you!

I'm going to deploy Nomad 0.12.1 across the infrastructure soon and I'll report back.

Thanks @DingoEatingFuzz 馃槄

I guess this issue can be closed now (I haven't seen it anymore).

Confirming, can be closed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mancusogmu picture mancusogmu  路  3Comments

DanielDent picture DanielDent  路  3Comments

jippi picture jippi  路  3Comments

bdclark picture bdclark  路  3Comments

hynek picture hynek  路  3Comments