Nomad v0.8.3 (c85483da3471f4bd3a7c3de112e95f551071769f)
We have some deployments which remained from old times of nomad v0.6.0 development and it bugs. So now we decide to fail this deployments because we periodically see in out server logs follow:
2018/05/11 13:12:20.665488 [ERR] nomad.deployments_watcher: failed to track deployment "354218d0-1f40-aa7d-6f9a-841a01e4d453": deployment "354218d0-1f40-aa7d-6f9a-841a01e4d453" references unknown job "S3apiCache"
2018/05/11 13:12:20.665512 [ERR] nomad.deployments_watcher: failed to track deployment "358f9dda-9feb-0f66-05e6-647f9e157747": deployment "358f9dda-9feb-0f66-05e6-647f9e157747" references unknown job "tdagent-local"
2018/05/11 13:12:20.665536 [ERR] nomad.deployments_watcher: failed to track deployment "503ffcb2-ca8e-5978-4316-6ef8d36c38a3": deployment "503ffcb2-ca8e-5978-4316-6ef8d36c38a3" references unknown job "ceph-zabbix"
2018/05/11 13:12:20.665559 [ERR] nomad.deployments_watcher: failed to track deployment "64c03451-f546-18b3-429d-f236b66478cc": deployment "64c03451-f546-18b3-429d-f236b66478cc" references unknown job "tdagent-local"
2018/05/11 13:12:20.665578 [ERR] nomad.deployments_watcher: failed to track deployment "73a0e737-47a2-df97-9899-6754a4697456": deployment "73a0e737-47a2-df97-9899-6754a4697456" references unknown job "webphp"
2018/05/11 13:12:20.665599 [ERR] nomad.deployments_watcher: failed to track deployment "785947d4-045b-0827-8180-eec01f0e0de2": deployment "785947d4-045b-0827-8180-eec01f0e0de2" references unknown job "S3apiCache"
All this deployments shows as they running for example for deployment 354218d0-1f40-aa7d-6f9a-841a01e4d453 short notation 354218d0
$ nomad deployment list | grep '354218d0'
354218d0 S3apiCache 53 running Deployment is running
Since S3apiCache job doesn't actually exist we try to manually fail this deployment, and got the same error that we see in nomad server logs
$ nomad deployment fail 354218d0
Error failing deployment: Unexpected response code: 500 (rpc error: deployment "354218d0-1f40-aa7d-6f9a-841a01e4d453" references unknown job "S3apiCache")
Because this deployments stays after buggy versions of nomad I does;t think that this is a bug, but looks strange that nomad doesn't cleanup from not existent jobs, and doen't allow do manual cleanup
After some investigations we found a solution for this. We create fake jobs with same names as in buggy deployments, then we can fail them and clear with GC
@tantra35 PR I just put up should clean them when upgrading to newer versions of Nomad. Don't want to add an endpoint since this isn't a case that should ever happen since it arouse from a bug that has since been fixed.
Most helpful comment
After some investigations we found a solution for this. We create fake jobs with same names as in buggy deployments, then we can fail them and clear with GC