Nomad: Deployments never finish in 0.7.0, but work in 0.7.0-rc1

Created on 8 Nov 2017  ยท  16Comments  ยท  Source: hashicorp/nomad

After moving our masters from 0.7.0-rc1 to 0.7.0 I found this issue:

This happens with a job with Count > 1, Service, Docker driver

  1. Deploy job update with Update stanza
  2. only first task is allocated
  3. deployment does not finish

In order to finish the deployment, I have to _promote_ the deployment via Jippi-ui, the API returns an error - but the deployment then continues and finishes successfully.


We had approximately 15 jobs waiting for deployment to end with the above issue. After I rolled the cluster back to 0.7.0-rc1 binaries all the jobs immediately continued deploying (and finished) once the leader was on rc1.

stagneeds-investigation themdeployments typbug

Most helpful comment

Hi, same issue here.

master upgraded to 0.7.0 from 0.7.0-rc1. Agents upgraded from 0.7.0-beta1 to 0.7.0.

With count = 4, I ran a nomad run, only 1 alloc was marked complete and 1 new alloc has started.
I had to run 4 times my nomad run to have the 4 new allocs started.

All 16 comments

When trying to replicate this on a local -dev environment with one local instance, 0.7.0 didn't have any issues. So must be something related to our cluster.

If it helps at all our agents were running _0.7.0-beta1_ at the time.. Could it be related?

Hi, same issue here.

master upgraded to 0.7.0 from 0.7.0-rc1. Agents upgraded from 0.7.0-beta1 to 0.7.0.

With count = 4, I ran a nomad run, only 1 alloc was marked complete and 1 new alloc has started.
I had to run 4 times my nomad run to have the 4 new allocs started.

I confirm that the nomad deployment promote workaround is working despite of the error (Error promoting deployment: Unexpected response code: 500 (rpc error: no canaries to promote)).

We'll definitely investigate and get this fixed ASAP. Could you include any relevant details possible? Job files or at least update stanzas. Logs or at least any deployment related lines or lines relating to the allocations involved.

Thanks for the reports! We're eager to get to the bottom of this!

@discobean @commarla I would also love to see what the alloc-status or the allocation via the API looked like when it was stuck. We emit events that say why the alloc is marked as healthy or unhealthy.

Here is my update stanza

update {
stagger           = "30s"
health_check      = "checks"
max_parallel      = "1"
min_healthy_time  = "10s"
healthy_deadline  = "5m"
auto_revert       = "true"
canary            = "0"
}

Same issue with just

update {
stagger           = "30s"
max_parallel      = "1"
}

I have add a fake env. var. to one of my services with a count = 5.

โฏ nomad plan fakeservice.hcl
+/- Job: "service"
+/- Task Group: "fakeservice" (1 create/destroy update, 4 ignore)
  +/- Task: "fakeservice" (forces create/destroy update)
    + Env[TEST]: "LOL"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 29474
To submit the job with version verification run:

I run nomad run fakeservice.hcl only one alloc is started.

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created At
02a32350  1d42ea9e fakeservice  7        run      running   11/08/17 21:43:43 CET
a181b442  c4431904  fakeservice  6        run      running   11/08/17 17:13:04 CET
079220ed  f86b1151  fakeservice  6        run      running   11/08/17 17:12:32 CET
85de9fd9  79428961  fakeservice  6        run      running   11/08/17 17:11:55 CET
074559ca  d7f7710e  fakeservice  6        run      running   11/08/17 17:11:23 CET
e6a22153  1d42ea9e  fakeservice  6        stop     complete  11/08/17 17:09:01 CET

A gist with the alloc status via the API : https://gist.github.com/commarla/57b1a7dae6cea5a4f6af665b742b1ab4

The deployment status :

โฏ nomad deployment status 15e63088
ID          = 15e63088
Job ID      = fakeservice
Job Version = 7
Status      = running
Description = Deployment is running

Deployed
Task Group                 Desired  Placed  Healthy  Unhealthy
fakeservice  5        1       1        0

Then I run

nomad deployment promote 15e63088
Error promoting deployment: Unexpected response code: 500 (rpc error: no canaries to promote)

And the deployment continue :

 nomad deployment status 15e63088
ID          = 15e63088
Job ID      = fakeservice
Job Version = 7
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group                 Desired  Placed  Healthy  Unhealthy
fakeservice  5        5       5        0

Thanks!

I have the same update stanza as @commarla the only difference is we use the Http API for deployments.

@discobean @commarla Does this happening on a new 0.7 cluster as well or only from the rc1 upgrade path?

@dadgar yes it does. We started a new 3 nodes cluster in 0.7.
Our clients are in 0.7 now but most of our jobs were start when our clients were in beta. We drained the old ones to restart the jobs on the new ones.

@dadgar It was an upgraded cluster.

I also tried stopping/GC and remove the job, then re-add it, but the result was the same.

@discobean @commarla Would it be possible to see server logs from the misbehaving 0.7.0 cluster

Hi @dadgar

I have no log during the nomad run only this one during our workaround promote

Nov 13 09:43:54 admin-10-32-151-102 nomad[30298]: 2017/11/13 09:43:54 [ERR] nomad.fsm: UpsertDeploymentPromotion failed: no canaries to promote
Nov 13 09:43:54 admin-10-32-151-102 nomad[30298]: 2017/11/13 09:43:54.848116 [ERR] http: Request /v1/deployment/promote/233039b3-f6ba-130c-c1bb-135b46b1a318, error: no canaries to promote

Today we tried upgrading again, this time going one by one from rc1 -> rc2 -> rc3 -> 0.7.0, and couldn't replicate the issue now.

We do shutdown the agents each night so maybe the problem will appear again tomorrow - I'll be sure to update.

@discobean @commarla I would love some reproduction steps. To be clear a setup to reproduce this starting from scratch with clear steps.

I spent quite a bit of time trying to reproduce this today and I couldn't. To reproduce I tried:

  1. Starting a cluster of Nomad 0.7.0-rc1 servers and beta1 clients since that is what you both had.
  2. Running jobs.
  3. Upgrading servers to 0.7.0
  4. Running more jobs to trigger deployments.

Further I tried the above with all 0.7.0-rc1 and also beta1. I also tried causing various leader elections during the process. Nothing I did could cause what you all are seeing.

I also played around with it in a fresh 0.7.0 cluster (no upgrades) and it worked properly as well.

@dadgar, I can no longer reproduce it either. Even after the cluster just came up this morning its all working as expected right now - I have no leads at all.

Okay I am going to close this. Please bump the issue if this comes up again. If it does please collect all the server logs as well as the allocation information for the misbehaving job both before and after using the promote work around.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

funkytaco picture funkytaco  ยท  3Comments

hamann picture hamann  ยท  3Comments

joliver picture joliver  ยท  3Comments

clinta picture clinta  ยท  3Comments

bdclark picture bdclark  ยท  3Comments