0.9.5, 0.10.0
Amazon Linux 2
When doing canary blue/green deployments, the promote API incorrectly report failed counts rather than the final healthy count, preventing promotions from occurring.
The screenshots below do not show the original deployment with failed allocations since I created a new deployment to try to mitigate the issue. However, even in this job version, the healthy count does not match what the promote API is reporting.
This issue is happening 100% of the time allocations fail on canary deploy, even if they are rescheduled and end up healthy.
Also an important detail: IT ALWAYS WORKS when there are no failed allocations. Only when allocations fail does this happen.


EDIT: Added Nomad 0.10.0 as version affected
I created another deployment which ended in failed allocations:


A few minutes later, I took a screenshot of the deployments view, which I figured might have more useful information. Looks like more allocations of the new version were added and more healthy resulted from that....more than it needed, but the promotion still fails with 4/8 unhealthy according to the RPC.

@endocrimes, I built a quick little service to test this around, and I'm finding interesting behaviors with deployment. A little bit of a tangent but might be related to the issue: ultimately, on a deployment, if an allocation goes unhealthy, I expect Nomad to restart the allocation, and it doesn't seem to be doing that, is that expected? (Nevermind, looks like you need to define a check_restart)
Here is randfail which includes a nomad jobfile to reproduce the behavior. This is running on Nomad 0.9.5 (server + agent).
To replicate the canary issue with Nomad 0.9.5:
nomad job run randfail.hcl and watch the job start, with some failing since FAIL_PERCENTAGE is set at 50%. The allocations will be rescheduled and eventually go healthy, which will result in a successful deploy.nomad job run randfail.canary.hcl to perform the blue/green. Allocations here will fail as well, but eventually the desired healthy count is reached from the rescheduled allocations. However, if you press Promote or use the promote API, it fails telling you there aren't enough healthy allocations to succeed.Hi guys, any confirmation if this is a known issue? Maybe I'm doing something wrong? Would love to get blue/green deployments going for our org, devs feel rolling deploys are a bit painful in terms of speed :wink:.
Tested this on 0.10.0 and verified that the issue still exists. Given blue/green deploy is a fundamental feature marketed for Nomad, and this existing issue which makes the feature unusable, when do we foresee traction for this issue?
Nomad 0.10.0 canary promotion failure:


@dadgar are you able to comment?
Interesting. It has been 50 minutes over the progress deadline and the deployment is still going waiting to be promoted, but it cannot be promoted because of the bug in this issue. So, this deployment is completely stuck, it does not fail. Looks like the only way to make it unstuck is run another deployment.
@eveld, I forgot to ask about this in the webinar this morning 😅. I would really appreciate an update on this issue as I'd love to sell blue/green deploys to my org but can't promote it currently because of this issue... could this issue get some ❤️ ?
Hi @djenriquez, sorry it took so long to get back to you on this. I've been trying to replicate the behavior you're showing here but unfortunately I haven't been able to do so.
I've been using the randfail jobs you provided (thanks so much for providing test cases!). I checked through the spec pretty carefully along with the docs to make sure that this isn't just a misconfiguration or documentation confusion.
It looks like the initial randfail.hcl job can take a very long time to converge to a successful state if the random number generation is unkind. In my most recent run here I had to place 25 containers before I got 10 that were ok. But once it was running I ran the randfail.canary.hcl, waiting for 10 healthy canaries to appear, and was able to promote it with no problems.
▶ nomad deployment status b69f6388
ID = b69f6388
Job ID = randfail
Job Version = 1
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline
randfail true 10 10 15 10 5 2019-11-08T21:02:55Z
I've assigned this issue to myself and I want to see if I can create a faster-converging test case, but first I need to be able to repro.
I did make one modification to the jobs which was to reduce the amount of memory used so that I could fit the deployment on my test rig. And I did notice on one run that one of the previous deploy's tasks failed _after_ the deployment had been marked healthy. Given the size of the jobs we're running here, is there any chance that some of the tasks are becoming unhealthy in your environment because of resource restrictions, and that's fouling the deployment?
Hi Tim,
Thank you for getting this on your list! Hmm if you are unable to replicate it then maybe something is wrong in my environment. I can pretty much replicate this behavior every time.
We are using Nomad 0.10.0 with Consul 1.6.1. The jobs definitely have enough resources and it is not an issue after the deployment, always during the deployments as shown in the screenshots in my earlier posts.
We use a three Nomad server deployment and navigate to the UI via a load balancer that round robins the requests between the Nomad servers. I wonder if I try a single server deployment if I would see these issues.
@tgross did you try to replicate the issue with a multi-server environment?
On Nov 8, 2019, at 1:06 PM, Tim Gross notifications@github.com wrote:
Hi @djenriquez, sorry it took so long to get back to you on this. I've been trying to replicate the behavior you're showing here but unfortunately I haven't been able to do so.I've been using the randfail jobs you provided (thanks so much for providing test cases!). I checked through the spec pretty carefully along with the docs to make sure that this isn't just a misconfiguration or documentation confusion.
It looks like the initial randfail.hcl job can take a very long time to converge to a successful state if the random number generation is unkind. In my most recent run here I had to place 25 containers before I got 10 that were ok. But once it was running I ran the randfail.canary.hcl, waiting for 10 healthy canaries to appear, and was able to promote it with no problems.
▶ nomad deployment status b69f6388
ID = b69f6388
Job ID = randfail
Job Version = 1
Status = successful
Description = Deployment completed successfullyDeployed
Task Group Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline
randfail true 10 10 15 10 5 2019-11-08T21:02:55Z
I've assigned this issue to myself and I want to see if I can create a faster-converging test case, but first I need to be able to repro.I did make one modification to the jobs which was to reduce the amount of memory used so that I could fit the deployment on my test rig. And I did notice on one run that one of the previous deploy's tasks failed after the deployment had been marked healthy. Given the size of the jobs we're running here, is there any chance that some of the tasks are becoming unhealthy in your environment because of resource restrictions, and that's fouling the deployment?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
@tgross I verified. In a single server deployment, everything works well. I reduced the server count to 1 for our development environment and ran through the randfail tests:

Promoting this deployment results in a successful deployment:

However, I increase the server count to _just_ 2 servers, and the issue returned:
With two servers (notice the previous deployment were successful, which is the deployment shown above)


So this issue only exists in multi-server environments, which I would think, would be _every_ production environment that uses Nomad.
It appears there is state synchronization issues. Bug in raft?
Thanks for that test! That's really interesting and probably narrows down the set of places where this bug could be lurking. I'll continue my investigation and circle back here soon.
Ok, I've been able to reproduce and found some interesting logs! I went through the process and waited for the deployment status to report 10 healthy, and then promoted it:
▶ nomad deployment status 6fad6d49
ID = 6fad6d49
Job ID = randfail
Job Version = 4
Status = running
Description = Deployment is running but requires manual promotion
Deployed
Task Group Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline
randfail false 10 10 16 10 6 2019-11-11T16:25:49Z
▶ nomad deployment promote 6fad6d49
Error promoting deployment: Unexpected response code: 500 (rpc error: 1 error(s) occurred:
* Task group "randfail" has 5/10 healthy allocations)
The job status is showing something strange, however. We're showing 10 healthy canaries for the deployment but only 15 allocations in the running state. There are 5 allocations showing complete, which is throwing off our calculation here. For service jobs allocations shouldn't be marked as complete, and there might be some buggy logic in the canary code that doesn't account for that. @djenriquez can you see if that's happening in your environment as well?
▶ nomad job status randfail
ID = randfail
Name = randfail
Submit Date = 2019-11-11T11:14:00-05:00
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
randfail 0 0 15 34 34 0
Latest Deployment
ID = 6fad6d49
Status = running
Description = Deployment is running but requires manual promotion
Deployed
Task Group Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline
randfail false 10 10 16 10 6 2019-11-11T16:25:49Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
1dc3f436 9a323737 randfail 4 run running 12m ago 11m47s ago
5b6f4233 5ecd049f randfail 4 run running 12m52s ago 12m17s ago
4874a6b8 9a323737 randfail 4 stop failed 12m52s ago 12m ago
016e1a2f 5ecd049f randfail 4 stop complete 12m52s ago 12m21s ago
6563ee3c 5ecd049f randfail 4 run running 12m52s ago 12m23s ago
1afd8c03 9a323737 randfail 4 run running 12m52s ago 12m14s ago
bcd7ed18 5ecd049f randfail 4 run running 13m35s ago 13m16s ago
cd1b3113 5ecd049f randfail 4 run running 13m35s ago 13m15s ago
15be4779 5ecd049f randfail 4 stop failed 13m35s ago 12m52s ago
9cee1388 9a323737 randfail 4 stop failed 13m35s ago 12m52s ago
969bef4a 5ecd049f randfail 4 run running 13m35s ago 13m14s ago
44c5cb3c 9a323737 randfail 4 run running 13m35s ago 13m13s ago
20eadfa0 5ecd049f randfail 4 stop failed 13m35s ago 12m52s ago
4329ba2c 9a323737 randfail 4 stop failed 13m35s ago 12m52s ago
9bfc5bd5 9a323737 randfail 4 run running 13m35s ago 13m23s ago
afcf0d9a 9a323737 randfail 4 stop failed 13m35s ago 12m52s ago
32b781ec 9a323737 randfail 3 stop complete 14m50s ago 12m21s ago
1a0220ab 9a323737 randfail 3 stop failed 16m22s ago 14m50s ago
cfa51ca5 9a323737 randfail 3 stop failed 17m53s ago 16m22s ago
349d9268 9a323737 randfail 3 stop failed 19m25s ago 17m53s ago
9b86619a 9a323737 randfail 3 run running 20m57s ago 20m43s ago
751d8b50 9a323737 randfail 3 stop failed 20m57s ago 19m25s ago
0b574c09 9a323737 randfail 3 stop failed 22m9s ago 20m57s ago
c4e5be40 5ecd049f randfail 3 stop complete 22m9s ago 12m21s ago
05ee3ff6 9a323737 randfail 3 stop failed 22m9s ago 20m57s ago
6516abdb 9a323737 randfail 3 run running 22m9s ago 21m33s ago
7ad3f3ed 9a323737 randfail 3 stop failed 23m ago 22m9s ago
a5a09865 5ecd049f randfail 3 run running 23m ago 22m21s ago
65b4c14b 9a323737 randfail 3 stop failed 23m ago 22m9s ago
1a88ac04 5ecd049f randfail 3 stop failed 23m ago 22m9s ago
31f1eb2a 9a323737 randfail 3 stop failed 23m ago 22m9s ago
7744af58 5ecd049f randfail 3 run running 23m ago 22m48s ago
482e2463 5ecd049f randfail 3 stop failed 23m42s ago 23m ago
7322b31f 5ecd049f randfail 3 stop failed 23m42s ago 23m ago
dc470860 5ecd049f randfail 3 run running 23m42s ago 23m11s ago
c77d1258 5ecd049f randfail 3 run running 23m42s ago 23m22s ago
eb48de13 9a323737 randfail 3 stop failed 23m42s ago 23m ago
da0336cf 9a323737 randfail 3 stop complete 23m42s ago 12m21s ago
ee13d690 5ecd049f randfail 3 stop failed 23m42s ago 23m ago
4639a023 9a323737 randfail 3 stop failed 23m42s ago 23m ago
099e0283 9a323737 randfail 3 stop failed 23m42s ago 23m ago
c8819235 9a323737 randfail 3 stop complete 23m42s ago 12m21s ago
...
Here's the alloc status of one of these jobs (it has no logs, however):
▶ nomad alloc status 016e1a2f
ID = 016e1a2f-f2c1-ea23-633f-4465f587fa10
Eval ID = a834cd25
Name = randfail.randfail[7]
Node ID = 5ecd049f
Node Name = ip-172-31-33-81
Job ID = randfail
Job Version = 4
Client Status = complete
Client Description = All tasks have completed
Desired Status = stop
Desired Description = alloc not needed due to job update
Created = 18m58s ago
Modified = 18m27s ago
Deployment ID = 6fad6d49
Deployment Health = healthy
Canary = true
Task "server" is "dead"
Task Resources
CPU Memory Disk Addresses
0/100 MHz 4.2 MiB/64 MiB 300 MiB http: 172.31.33.81:24409
Task Events:
Started At = 2019-11-11T16:14:44Z
Finished At = 2019-11-11T16:15:14Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2019-11-11T11:15:14-05:00 Killed Task successfully killed
2019-11-11T11:15:14-05:00 Terminated Exit Code: 2, Exit Message: "Docker container exited with non-zero exit code: 2"
2019-11-11T11:15:14-05:00 Killing Sent interrupt. Waiting 5s before force killing
2019-11-11T11:14:44-05:00 Started Task started by client
2019-11-11T11:14:43-05:00 Driver Downloading image
2019-11-11T11:14:43-05:00 Task Setup Building Task Directory
2019-11-11T11:14:43-05:00 Received Task received by client
The server logs also show that each server has a different view of the task group health at the moment we try to do the promotion.
Server 1:
Nov 11 16:16:38 ip-172-31-39-230 nomad[2173]: 2019-11-11T16:16:38.104Z [ERROR] nomad.fsm: UpsertDeploymentPromotion failed: error="1 error(s) occurred:
Nov 11 16:16:38 ip-172-31-39-230 nomad[2173]: * Task group "randfail" has 6/10 healthy allocations"
Server 2:
Nov 11 16:16:38 ip-172-31-37-118 nomad[2161]: 2019-11-11T16:16:38.079Z [ERROR] nomad.fsm: UpsertDeploymentPromotion failed: error="1 error(s) occurred:
Nov 11 16:16:38 ip-172-31-37-118 nomad[2161]: * Task group "randfail" has 8/10 healthy allocations"
Server 3:
Nov 11 16:16:38 ip-172-31-45-45 nomad[2168]: 2019-11-11T16:16:38.020Z [ERROR] nomad.fsm: UpsertDeploymentPromotion failed: error="1 error(s) occurred:
Nov 11 16:16:38 ip-172-31-45-45 nomad[2168]: * Task group "randfail" has 5/10 healthy allocations"
But the weird thing is this is persistent. Even after I've waited long enough to write all this report up, I retried the deployment and they all show the same error message; they haven't converged on an opinion about task group health. I may need to dig into how they gets determined.
It looks like it isn't the allocation exiting by itself that's causing them to be marked complete, but Nomad that's stopping them. We can see the following in the alloc status:
Desired Status = stop
Desired Description = alloc not needed due to job update
The eval for that allocation was:
▶ nomad eval status -verbose a834cd25
ID = a834cd25-2d9d-1d99-7400-a024d148358f
Create Time = 2019-11-11T11:14:35-05:00
Modify Time = 2019-11-11T11:14:43-05:00
Status = complete
Status Description = complete
Type = service
TriggeredBy = alloc-failure
Priority = 50
Placement Failures = false
Wait Until = 2019-11-11T16:14:43Z
Previous Eval = 2b663f9d-1c2a-8b8b-f86b-c9fd8c042bdd
Next Eval = <none>
Blocked Eval = <none>
Which has the following associated log entries:
2019-11-11T16:14:43.746Z [DEBUG] worker: dequeued evaluation: eval_id=a834cd25-2d9d-1d99-7400-a024d148358f
2019-11-11T16:14:43.746Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=a834cd25-2d9d-1d99-7400-a024d148358f job_id=randfail namespace=default results="Total changes: (place 5) (destructive 0) (inplace 0) (stop 5)
Desired Changes for "randfail": (place 5) (inplace 0) (destructive 0) (stop 5) (migrate 0) (ignore 15) (canary 0)"
2019-11-11T16:14:43.760Z [DEBUG] worker: submitted plan for evaluation: eval_id=a834cd25-2d9d-1d99-7400-a024d148358f
2019-11-11T16:14:43.760Z [DEBUG] worker.service_sched: setting eval status: eval_id=a834cd25-2d9d-1d99-7400-a024d148358f job_id=randfail namespace=default status=complete
2019-11-11T16:14:43.768Z [DEBUG] worker: updated evaluation: eval="<Eval "a834cd25-2d9d-1d99-7400-a024d148358f" JobID: "randfail" Namespace: "default">"
2019-11-11T16:14:43.769Z [DEBUG] worker: ack evaluation: eval_id=a834cd25-2d9d-1d99-7400-a024d148358f
So now we need to figure out why Nomad decided that we were supposed to stop these allocations.
I've been leaving this job running as I've been debugging, and a random failed health check on one of the tasks kicked off a new round of evaluations and a few more completed tasks. This brought the count of allegedly health allocations up to 8/10.
And here's the allocations for our job, which is showing _more_ than 10 of version 4 now:
Allocations
ID Node ID Task Group Version Desired Status Created Modified
583cbfad 9a323737 randfail 4 stop complete 1h17m ago 1h17m ago
f10ea6cb 5ecd049f randfail 4 stop complete 1h17m ago 1h17m ago
aebcfb34 5ecd049f randfail 4 run running 1h17m ago 1h16m ago
04fbda89 9a323737 randfail 4 run running 1h17m ago 1h16m ago
605fe0f4 9a323737 randfail 4 stop complete 1h17m ago 1h17m ago
fe72e828 5ecd049f randfail 4 run running 1h17m ago 1h16m ago
1d02f273 5ecd049f randfail 4 stop complete 1h17m ago 1h17m ago
401019b0 9a323737 randfail 4 stop complete 1h17m ago 1h17m ago
9e690906 9a323737 randfail 4 run running 1h18m ago 1h17m ago
24772c81 5ecd049f randfail 4 run running 1h18m ago 1h17m ago
8a42783e 9a323737 randfail 4 stop complete 1h18m ago 1h17m ago
cf890b71 9a323737 randfail 4 stop failed 1h18m ago 1h17m ago
1dc3f436 9a323737 randfail 4 run running 3h39m ago 3h39m ago
6563ee3c 5ecd049f randfail 4 run running 3h40m ago 3h39m ago
1afd8c03 9a323737 randfail 4 run running 3h40m ago 3h39m ago
5b6f4233 5ecd049f randfail 4 stop complete 3h40m ago 1h17m ago
969bef4a 5ecd049f randfail 4 run running 3h41m ago 3h40m ago
cd1b3113 5ecd049f randfail 4 run running 3h41m ago 3h40m ago
bcd7ed18 5ecd049f randfail 4 run running 3h41m ago 3h40m ago
9bfc5bd5 9a323737 randfail 4 run running 3h41m ago 3h40m ago
44c5cb3c 9a323737 randfail 4 run running 3h41m ago 3h40m ago
9b86619a 9a323737 randfail 3 stop complete 3h48m ago 1h17m ago
6516abdb 9a323737 randfail 3 run running 3h49m ago 3h49m ago
7744af58 5ecd049f randfail 3 run running 3h50m ago 3h50m ago
a5a09865 5ecd049f randfail 3 run running 3h50m ago 3h49m ago
c77d1258 5ecd049f randfail 3 stop complete 3h51m ago 1h17m ago
dc470860 5ecd049f randfail 3 run running 3h51m ago 3h50m ago
Which the deployment status agrees with:
▶ nomad deployment status 6fad6d49
ID = 6fad6d49
Job ID = randfail
Job Version = 4
Status = running
Description = Deployment is running but requires manual promotion
Deployed
Task Group Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline
randfail false 10 10 28 17 11 2019-11-11T18:48:21Z
@tgross working through some of your questions right now. The jobs are rescheduled via the reschedule stanza timeouts, so setting those to lower values and the health checking could speed up the test case quite a bit.
I'm running on a 2 server Nomad deployment, each server shows the same status (might be just my luck, i'll try to increase the fail percentage after this), but the status shows the same for both:
[ec2-user@ip-10-179-163-250 ~]$ nomad deployment status d6e28944
ID = d6e28944
Job ID = randfail
Job Version = 6
Status = running
Description = Deployment is running but requires manual promotion
Deployed
Task Group Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline
randfail false 10 10 15 10 5 2019-11-11T20:53:05Z
[ec2-user@ip-10-179-163-250 ~]$ nomad deployment promote d6e28944
Error promoting deployment: Unexpected response code: 500 (rpc error: 1 error(s) occurred:
* Task group "randfail" has 9/10 healthy allocations)
[ec2-user@ip-10-179-158-186 ~]$ nomad deployment status d6e28944
ID = d6e28944
Job ID = randfail
Job Version = 6
Status = running
Description = Deployment is running but requires manual promotion
Deployed
Task Group Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline
randfail false 10 10 15 10 5 2019-11-11T20:53:05Z
[ec2-user@ip-10-179-158-186 ~]$ nomad deployment promote d6e28944
Error promoting deployment: Unexpected response code: 500 (1 error(s) occurred:
* Task group "randfail" has 9/10 healthy allocations)
ip-10-179-158-186 is reporting as the leader.
When I list the allocations, I do see an alloc of the previous (blue) job version that shows as complete:
[ec2-user@ip-10-179-163-250 ~]$ nomad alloc status f34d7544
ID = f34d7544
Eval ID = 0b3801cd
Name = randfail.randfail[9]
Node ID = 9590c3a5
Node Name = i-053006cf1b10c0c05
Job ID = randfail
Job Version = 5
Client Status = complete
Client Description = All tasks have completed
Desired Status = stop
Desired Description = alloc not needed due to job update
Created = 12m47s ago
Modified = 7m39s ago
Deployment ID = f978b097
Deployment Health = healthy
Task "server" is "dead"
Task Resources
CPU Memory Disk Addresses
0/100 MHz 728 KiB/300 MiB 300 MiB http: 10.179.173.53:22187
Task Events:
Started At = 2019-11-11T20:35:22Z
Finished At = 2019-11-11T20:40:28Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2019-11-11T20:40:28Z Killed Task successfully killed
2019-11-11T20:40:28Z Terminated Exit Code: 2, Exit Message: "Docker container exited with non-zero exit code: 2"
2019-11-11T20:40:28Z Killing Sent interrupt. Waiting 5s before force killing
2019-11-11T20:35:22Z Started Task started by client
2019-11-11T20:35:20Z Driver Downloading image
2019-11-11T20:35:20Z Task Setup Building Task Directory
2019-11-11T20:35:20Z Received Task received by client
It's evaluation:
[ec2-user@ip-10-179-163-250 ~]$ nomad eval status -verbose 0b3801cd
ID = 0b3801cd-b1a9-2beb-5eac-04b52f3ee7ec
Create Time = 2019-11-11T20:36:21Z
Modify Time = 2019-11-11T20:36:21Z
Status = complete
Status Description = complete
Type = service
TriggeredBy = job-register
Job ID = randfail
Priority = 50
Placement Failures = false
Previous Eval = <none>
Next Eval = <none>
Blocked Eval = <none>
The slow rotation of tasks onto version 4 made me suspicious. So I caused one of the version 3 tasks to lose its network connectivity and fail its health check, and in response Nomad created multiple version 4 tasks:
Allocations
ID Node ID Task Group Version Desired Status Created Modified
11d0d6dd 5ecd049f randfail 4 run running 9s ago 8s ago
e8405340 9a323737 randfail 4 run running 9s ago 8s ago
ac38de6c 9a323737 randfail 4 run running 13s ago 12s ago
8b0c6453 9a323737 randfail 4 stop complete 22s ago 8s ago
1c04ea9a 9a323737 randfail 4 run running 22s ago 10s ago
418bcfa6 9a323737 randfail 4 stop complete 22s ago 8s ago
aebcfb34 5ecd049f randfail 4 run running 2h10m ago 2h9m ago
04fbda89 9a323737 randfail 4 run running 2h10m ago 2h9m ago
fe72e828 5ecd049f randfail 4 run running 2h10m ago 2h9m ago
9e690906 9a323737 randfail 4 run running 2h11m ago 2h10m ago
24772c81 5ecd049f randfail 4 run running 2h11m ago 2h10m ago
1dc3f436 9a323737 randfail 4 run running 4h32m ago 4h32m ago
6563ee3c 5ecd049f randfail 4 run running 4h33m ago 4h32m ago
1afd8c03 9a323737 randfail 4 run running 4h33m ago 4h32m ago
969bef4a 5ecd049f randfail 4 run running 4h34m ago 4h33m ago
9bfc5bd5 9a323737 randfail 4 run running 4h34m ago 4h33m ago
cd1b3113 5ecd049f randfail 4 run running 4h34m ago 4h33m ago
44c5cb3c 9a323737 randfail 4 run running 4h34m ago 4h33m ago
bcd7ed18 5ecd049f randfail 4 run running 4h34m ago 4h33m ago
6516abdb 9a323737 randfail 3 run running 4h42m ago 4h42m ago
7744af58 5ecd049f randfail 3 run running 4h43m ago 4h43m ago
a5a09865 5ecd049f randfail 3 run running 4h43m ago 4h42m ago
dc470860 5ecd049f randfail 3 stop failed 4h44m ago 13s ago
At this point, Nomad thinks we have 10/10 and a request to promote succeeds:
▶ nomad deployment promote 6fad6d49
==> Monitoring evaluation "bc955bbc"
Evaluation triggered by job "randfail"
Evaluation within deployment: "6fad6d49"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "bc955bbc" finished with status "complete"
The version 3 jobs are then stopped, as are any version 4 jobs more than the 10 we want.
@djenriquez at this point I've got a working reproduction and at least some theories as to where the problems could be occurring. I'm going to try to create a more minimal test case and start digging into the areas of code we're touching here next.
@djenriquez I've run a number of tests changing some of the parameters to see if I can narrow the search space a bit. With a smaller size job that fails a bit less often, or when using a smaller number of canaries, I've found it's easier to get a successful or failed deployment than this mysterious stuck deployment. So worst case scenario for you in the meantime, you might want consider using a smaller set of canaries than full blue/green until we have this figured out.
Here's where we are so far. I've done some scripting to parse through the evaluations API and unfortunately, it looks like there may be two different buggy behaviors to unpack here, and the obvious symptoms described as (1) below aren't necessarily the direct cause of the deploys getting stuck (2) because we can get stuck without this happening. But because I can't reproduce either of these behaviors without canaries, that at least narrows it down to the canary code.
N+1, there are allocations of job version N that are being prematurely replaced. Random N allocations are being marked for marked for stopping _before_ we try to promote the job, and even before all the canaries have been deployed. The allocation status includes the following:Desired Status = stop
Desired Description = alloc not needed due to job update
That particular message only shows up in a few places in the code base, mostly around the handleGroupCanaries() method in scheduler/reconcile.go#L563-L613. So that'll be the obvious next place to look.
This graph shows an example of an in-flight deployment. The blue allocations on the right hand side (1a6eea8 and 5d81eb5) were v0 allocations that Nomad marked for stopping as above, even though the v1 deploy is still ongoing. This deployment goes on for quite a while and results in many many more evaluations before getting stuck.

▶ nomad deployment promote 1358f244
Error promoting deployment: Unexpected response code: 500 (rpc error: 1 error(s) occurred:
* Task group "randfail" has 4/5 healthy allocations)
▶ nomad job status randfail
ID = randfail
Name = randfail
Submit Date = 2019-11-13T11:49:34-05:00
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
randfail 0 0 10 1 5 0
Latest Deployment
ID = 1358f244
Status = running
Description = Deployment is running but requires manual promotion
Deployed
Task Group Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline
randfail false 5 5 6 5 1 2019-11-13T17:00:04Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
628b73d1 54eca36f randfail 2 run running 2m20s ago 2m8s ago
4dd362ed 4f1ca2ae randfail 2 run running 2m37s ago 2m22s ago
205e4dff 54eca36f randfail 2 run running 2m37s ago 2m23s ago
d5674105 54eca36f randfail 2 run running 2m37s ago 2m24s ago
3afc7d56 4f1ca2ae randfail 2 stop failed 2m37s ago 2m20s ago
f1d59c8d 54eca36f randfail 2 run running 2m37s ago 2m24s ago
4b2996a1 54eca36f randfail 1 run running 3m3s ago 2m47s ago
6358e75c 54eca36f randfail 1 run running 3m3s ago 2m51s ago
95cca0e0 4f1ca2ae randfail 1 run running 3m3s ago 2m49s ago
ab75f24b 4f1ca2ae randfail 1 run running 3m3s ago 2m51s ago
d8dcf3e5 4f1ca2ae randfail 1 run running 3m38s ago 3m24s ago
5905a42b 54eca36f randfail 0 stop complete 4m34s ago 3m3s ago
c5587363 54eca36f randfail 0 stop complete 4m34s ago 3m3s ago
346a1a4f 4f1ca2ae randfail 0 stop complete 4m34s ago 3m3s ago
4732a9a2 4f1ca2ae randfail 0 stop complete 4m34s ago 3m3s ago
1d68b747 54eca36f randfail 0 stop complete 4m34s ago 3m3s ago
This is the stuck state graph, taken after the progress timer should has expired. The failed alloc is the red alloc 3afc7d56 on the left. The deployment 1358f244 (blue) hasn't been marked as failed, and we have 5 green canaries, but we can't promote it.

With a slightly improved version of the script I'm using to process the data, I think I've got a clue. The graph below is the same data as (2) above, but with the addition of the blue arrows that show the canaries that the deployments are tracking. Note that our stuck deployment deploy:1358f24 has no arrow to alloc:628b73d, which replaced the failed alloc:3afc7d5.
So it looks like the deployment is not tracking restarted canaries.

The odd thing here is that we'd expect that this rescheduled allocation shouldn't have been scheduled at all:
https://www.nomadproject.io/docs/job-specification/reschedule.html#rescheduling-during-deployments
Rescheduling during deployments
The update stanza controls rolling updates and canary deployments. A task group's reschedule stanza does not take affect during a deployment. For example, if a new version of the job is rolled out and the deployment failed due to a failing allocation, Nomad will not reschedule it.
But the raw data definitely says that eval:f3133b4e was triggered due to "alloc-failure"; the prior eval:caa77946 says it was triggered by "deployment-watcher".
Another workaround for you while we work on this @djenriquez is that if you have the restart stanza set, the rescheduling process won't need to kick in nearly as often and that should make for more reliable deployments. (And they're faster too because it's just restarting the task.)
This is incredibly impressive work on the investigation @tgross. I'm not sure where I can help you, but I'm more than happy to do so if there are things I'm able to do.
With the work-around, that actually doesn't work well for us since the problem we experience that trigger this is related to https://github.com/hashicorp/nomad/issues/6567, which fails allocations immediately. It happens, unfortunately, more common than one would hope.
There is also https://github.com/hashicorp/nomad/issues/6350, which was pre-0.10.0 but back then we had to register consul services on the task level, so we registered them on the sidecar proxy task since that represented the network for the service. With a restart stanza on the sidecar proxy's service, Nomad would only send a shutdown signal to the proxy, not the app itself, completely nullifying the intent of the restart stanza--the proxy would get restarted over and over while the app that was in the bad state stayed in that bad state. It may be better with 0.10.0 since the service and check is registered on the task group level w/ network namespaces. Something for me to verify.
With the work-around, that actually doesn't work well for us since the problem we experience that trigger this is related to #6567, which fails allocations immediately. It happens, unfortunately, more common than one would hope.
Ah, right... well the good news is that my PR to the CNI plugins project was accepted and so hopefully they'll be cutting as release with that soon.
With a restart stanza on the sidecar proxy's service, Nomad would only send a shutdown signal to the proxy, not the app itself, completely nullifying the intent of the restart stanza--the proxy would get restarted over and over while the app that was in the bad state stayed in that bad state.
That issue is on my radar as well!
After some more testing I have working single node tests and I've been breaking down the symptoms. There are two real problems and one false one:
Deployments with canaries but without restarts should fail when any canary fails, but instead they get stuck.
nomad/deploymentwatcher/deploymentwatcher.go, but it behaves correctly assuming that the alloc deployment status gets updated.client/allocrunner next.When a deployment is running, a failed alloc for a previous job version should not be replaced with an alloc for the deployment's version but the previous version.
scheduler/reconcile.go so that's where I'm looking next.One of the symptoms we've seen in the large scale tests is actually the correct behavior but it's emergent when the other bugs are happening. When a deployment job with canaries gets stuck, the next deployment will cause all its canaries to be stopped because they were never promoted. When this happens alongside (1), this looks like we're stopping previous job version allocs prematurely.
The issue I've identified as (2) above turns out to be a known tradeoff in the scheduler, which is in our internal design docs but not documented in public. I've opened #6723 to make sure this gets documented.
That leaves:
If an alloc passes health checks and is marked as healthy but later dies, the deployment status doesn't get updated and this is how we get stuck.
For each allocation that's part of a deployment, we start up an health_hook on the client. This health hook watches the allocation (see the relevant code here) until it is marked healthy (where "healthy" is defined by update.health_check, or until the update.healthy_deadline passes, or until the allocation stops. But once _one_ of these events happens, we report the status to the server's deployment watcher and then we exit the client-side watcher.
Which means if the allocation fails _after_ it initially reports it's healthy, we're in a state where the deployment thinks everyone is healthy, but when we promote the deployment to make that a reality, the server's state machine rejects the change because there aren't enough canary allocations that are actually healthy.
The "obvious" fix to do is to _not_ exit the watcher loop, but allow subsequent health updates to change the deployment status. But it's not clear how this should interact with the progress deadline. If the progress deadline has passed with all canaries healthy, but one of the canaries fails, what should happen? Are there other implications to the deployment behavior I haven't thought of?
Alternately (as a somewhat incomplete fix), we could have the promotion process fail the deployment if it hits an error past the promotion deadline.
Minimal reproduction of the remaining issue:
# start a job
nomad job run ./test.hcl
# wait for the deployment to succeed
nomad deployment status $id
# bump the env.version and redeploy
nomad job run ./test.hcl
# wait for all 3 allocs to be marked healthy, so that the
# deployment is pending manual approval
nomad deployment status $id
# note that we have 6 running allocs
nomad job status test
# kill one of the new containers
docker kill $(docker ps | awk '/nginx/{print $1}' | head -1)
# wait for progress deadline to expire, deployment will not be failed
# and promotion will be stuck
nomad job status test
nomad deployment status $id
nomad deployment promote $id
jobspec:
job "test" {
datacenters = ["dc1"]
group "webservers" {
count = 3
task "nginx" {
driver = "docker"
config {
image = "nginx:latest"
port_map = {
http = 80
}
}
env {
version = "0"
}
service {
name = "nginx"
port = "http"
check {
type = "http"
port = "http"
path = "/"
interval = "5s"
timeout = "3s"
check_restart {
limit = 1
grace = "5s"
ignore_warnings = false
}
}
}
resources {
memory = 64
network {
mbits = 10
port "http" {}
}
}
}
restart {
attempts = 0
delay = "10s"
}
update {
max_parallel = 3
health_check = "checks"
min_healthy_time = "5s"
healthy_deadline = "30s"
progress_deadline = "2m"
auto_revert = false
auto_promote = false
canary = 3
}
}
}
@tgross, I think that as a system user, I would expect that during a deployment, if a canary goes healthy then fails, a new allocation needs to be added to make up for the failed allocation that happened during the deployment. As far as the progress_deadline, I think that carries the same logic we would expect, where failed allocations do not alter it as it would just be watching for new allocations to go healthy.
The edge case I see in this is the possibility of an endless deployment, where an allocation always goes unhealthy and is replaced before the final allocation can be marked as healthy. In this case, the progress deadline will continually be reset as new allocations come in and become healthy for a period of time. Although, I don't think this is incorrect as that should be the responsibility of the healthy_deadline. If allocations are failing later, then a user has not represented the health lifecycle of their application well enough in the job config.
The "obvious" fix to do is to not exit the watcher loop, but allow subsequent health updates to change the deployment status. But it's not clear how this should interact with the progress deadline. If the progress deadline has passed with all canaries healthy, but one of the canaries fails, what should happen? Are there other implications to the deployment behavior I haven't thought of?
Also, taking a step back, I want to make sure the focus is on the correct issue. Yes, I do see the problematic scenario presented by this allocation watcher, but why is it not an issue when there is a single server? That to me seems like more of a shared state issue than it is a logical issue with how the system handles failed allocations during a deploy. Remember, with a single server, it didn't matter how many times the allocations failed during the deploy, once the healthy number was hit, we were able to promote.
The edge case I see in this is the possibility of an endless deployment, where an allocation always goes unhealthy and is replaced before the final allocation can be marked as healthy. In this case, the progress deadline will continually be reset as new allocations come in and become healthy for a period of time.
Yeah, that's a good point.
Remember, with a single server, it didn't matter how many times the allocations failed during the deploy, once the healthy number was hit, we were able to promote.
I know I ended up spewing a whole bunch of debugging info here so it would have been easy to miss 😀 , but it turned out starting in https://github.com/hashicorp/nomad/issues/6407#issuecomment-554538572 that I was able to reproduce the behavior with a single node, even just in dev mode.
I know I ended up spewing a whole bunch of debugging info here so it would have been easy to miss 😀 , but it turned out starting in #6407 (comment) that I was able to reproduce the behavior with a single node, even just in dev mode.
Ahhhh, I didn't know that. Interesting, so the problem goes deeper. Thanks for clarifying that for me.
I've opened https://github.com/hashicorp/nomad/issues/6740 as our way to cover the remaining issue here. There's going to be a bunch of documentation improvements we should do as well, starting with https://github.com/hashicorp/nomad/issues/6723