After moving our masters from 0.7.0-rc1 to 0.7.0 I found this issue:
This happens with a job with Count > 1, Service, Docker driver
In order to finish the deployment, I have to _promote_ the deployment via Jippi-ui, the API returns an error - but the deployment then continues and finishes successfully.
We had approximately 15 jobs waiting for deployment to end with the above issue. After I rolled the cluster back to 0.7.0-rc1 binaries all the jobs immediately continued deploying (and finished) once the leader was on rc1.
When trying to replicate this on a local -dev environment with one local instance, 0.7.0 didn't have any issues. So must be something related to our cluster.
If it helps at all our agents were running _0.7.0-beta1_ at the time.. Could it be related?
Hi, same issue here.
master upgraded to 0.7.0 from 0.7.0-rc1. Agents upgraded from 0.7.0-beta1 to 0.7.0.
With count = 4, I ran a nomad run, only 1 alloc was marked complete and 1 new alloc has started.
I had to run 4 times my nomad run to have the 4 new allocs started.
I confirm that the nomad deployment promote workaround is working despite of the error (Error promoting deployment: Unexpected response code: 500 (rpc error: no canaries to promote)).
We'll definitely investigate and get this fixed ASAP. Could you include any relevant details possible? Job files or at least update stanzas. Logs or at least any deployment related lines or lines relating to the allocations involved.
Thanks for the reports! We're eager to get to the bottom of this!
@discobean @commarla I would also love to see what the alloc-status or the allocation via the API looked like when it was stuck. We emit events that say why the alloc is marked as healthy or unhealthy.
Here is my update stanza
update {
stagger = "30s"
health_check = "checks"
max_parallel = "1"
min_healthy_time = "10s"
healthy_deadline = "5m"
auto_revert = "true"
canary = "0"
}
Same issue with just
update {
stagger = "30s"
max_parallel = "1"
}
I have add a fake env. var. to one of my services with a count = 5.
โฏ nomad plan fakeservice.hcl
+/- Job: "service"
+/- Task Group: "fakeservice" (1 create/destroy update, 4 ignore)
+/- Task: "fakeservice" (forces create/destroy update)
+ Env[TEST]: "LOL"
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 29474
To submit the job with version verification run:
I run nomad run fakeservice.hcl only one alloc is started.
Allocations
ID Node ID Task Group Version Desired Status Created At
02a32350 1d42ea9e fakeservice 7 run running 11/08/17 21:43:43 CET
a181b442 c4431904 fakeservice 6 run running 11/08/17 17:13:04 CET
079220ed f86b1151 fakeservice 6 run running 11/08/17 17:12:32 CET
85de9fd9 79428961 fakeservice 6 run running 11/08/17 17:11:55 CET
074559ca d7f7710e fakeservice 6 run running 11/08/17 17:11:23 CET
e6a22153 1d42ea9e fakeservice 6 stop complete 11/08/17 17:09:01 CET
A gist with the alloc status via the API : https://gist.github.com/commarla/57b1a7dae6cea5a4f6af665b742b1ab4
The deployment status :
โฏ nomad deployment status 15e63088
ID = 15e63088
Job ID = fakeservice
Job Version = 7
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy
fakeservice 5 1 1 0
Then I run
nomad deployment promote 15e63088
Error promoting deployment: Unexpected response code: 500 (rpc error: no canaries to promote)
And the deployment continue :
nomad deployment status 15e63088
ID = 15e63088
Job ID = fakeservice
Job Version = 7
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy
fakeservice 5 5 5 0
Thanks!
I have the same update stanza as @commarla the only difference is we use the Http API for deployments.
@discobean @commarla Does this happening on a new 0.7 cluster as well or only from the rc1 upgrade path?
@dadgar yes it does. We started a new 3 nodes cluster in 0.7.
Our clients are in 0.7 now but most of our jobs were start when our clients were in beta. We drained the old ones to restart the jobs on the new ones.
@dadgar It was an upgraded cluster.
I also tried stopping/GC and remove the job, then re-add it, but the result was the same.
@discobean @commarla Would it be possible to see server logs from the misbehaving 0.7.0 cluster
Hi @dadgar
I have no log during the nomad run only this one during our workaround promote
Nov 13 09:43:54 admin-10-32-151-102 nomad[30298]: 2017/11/13 09:43:54 [ERR] nomad.fsm: UpsertDeploymentPromotion failed: no canaries to promote
Nov 13 09:43:54 admin-10-32-151-102 nomad[30298]: 2017/11/13 09:43:54.848116 [ERR] http: Request /v1/deployment/promote/233039b3-f6ba-130c-c1bb-135b46b1a318, error: no canaries to promote
Today we tried upgrading again, this time going one by one from rc1 -> rc2 -> rc3 -> 0.7.0, and couldn't replicate the issue now.
We do shutdown the agents each night so maybe the problem will appear again tomorrow - I'll be sure to update.
@discobean @commarla I would love some reproduction steps. To be clear a setup to reproduce this starting from scratch with clear steps.
I spent quite a bit of time trying to reproduce this today and I couldn't. To reproduce I tried:
Further I tried the above with all 0.7.0-rc1 and also beta1. I also tried causing various leader elections during the process. Nothing I did could cause what you all are seeing.
I also played around with it in a fresh 0.7.0 cluster (no upgrades) and it worked properly as well.
@dadgar, I can no longer reproduce it either. Even after the cluster just came up this morning its all working as expected right now - I have no leads at all.
Okay I am going to close this. Please bump the issue if this comes up again. If it does please collect all the server logs as well as the allocation information for the misbehaving job both before and after using the promote work around.
Most helpful comment
Hi, same issue here.
master upgraded to 0.7.0 from 0.7.0-rc1. Agents upgraded from 0.7.0-beta1 to 0.7.0.
With count = 4, I ran a
nomad run, only 1 alloc was marked complete and 1 new alloc has started.I had to run 4 times my
nomad runto have the 4 new allocs started.