Nomad: Can't manualy fail deployment from buggy versions of nomad

Created on 11 May 2018  路  2Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.8.3 (c85483da3471f4bd3a7c3de112e95f551071769f)

Issue

We have some deployments which remained from old times of nomad v0.6.0 development and it bugs. So now we decide to fail this deployments because we periodically see in out server logs follow:

2018/05/11 13:12:20.665488 [ERR] nomad.deployments_watcher: failed to track deployment "354218d0-1f40-aa7d-6f9a-841a01e4d453": deployment "354218d0-1f40-aa7d-6f9a-841a01e4d453" references unknown job "S3apiCache"
2018/05/11 13:12:20.665512 [ERR] nomad.deployments_watcher: failed to track deployment "358f9dda-9feb-0f66-05e6-647f9e157747": deployment "358f9dda-9feb-0f66-05e6-647f9e157747" references unknown job "tdagent-local"
2018/05/11 13:12:20.665536 [ERR] nomad.deployments_watcher: failed to track deployment "503ffcb2-ca8e-5978-4316-6ef8d36c38a3": deployment "503ffcb2-ca8e-5978-4316-6ef8d36c38a3" references unknown job "ceph-zabbix"
2018/05/11 13:12:20.665559 [ERR] nomad.deployments_watcher: failed to track deployment "64c03451-f546-18b3-429d-f236b66478cc": deployment "64c03451-f546-18b3-429d-f236b66478cc" references unknown job "tdagent-local"
2018/05/11 13:12:20.665578 [ERR] nomad.deployments_watcher: failed to track deployment "73a0e737-47a2-df97-9899-6754a4697456": deployment "73a0e737-47a2-df97-9899-6754a4697456" references unknown job "webphp"
2018/05/11 13:12:20.665599 [ERR] nomad.deployments_watcher: failed to track deployment "785947d4-045b-0827-8180-eec01f0e0de2": deployment "785947d4-045b-0827-8180-eec01f0e0de2" references unknown job "S3apiCache"

All this deployments shows as they running for example for deployment 354218d0-1f40-aa7d-6f9a-841a01e4d453 short notation 354218d0

$ nomad deployment list | grep '354218d0'
354218d0  S3apiCache                         53           running     Deployment is running

Since S3apiCache job doesn't actually exist we try to manually fail this deployment, and got the same error that we see in nomad server logs

$ nomad deployment fail 354218d0
Error failing deployment: Unexpected response code: 500 (rpc error: deployment "354218d0-1f40-aa7d-6f9a-841a01e4d453" references unknown job "S3apiCache")

Because this deployments stays after buggy versions of nomad I does;t think that this is a bug, but looks strange that nomad doesn't cleanup from not existent jobs, and doen't allow do manual cleanup

Most helpful comment

After some investigations we found a solution for this. We create fake jobs with same names as in buggy deployments, then we can fail them and clear with GC

All 2 comments

After some investigations we found a solution for this. We create fake jobs with same names as in buggy deployments, then we can fail them and clear with GC

@tantra35 PR I just put up should clean them when upgrading to newer versions of Nomad. Don't want to add an endpoint since this isn't a case that should ever happen since it arouse from a bug that has since been fixed.

Was this page helpful?
0 / 5 - 0 ratings