Nomad: Deployments never finish in 0.7.0, but work in 0.7.0-rc1

Created on 8 Nov 2017 · 16Comments · Source: hashicorp/nomad

After moving our masters from 0.7.0-rc1 to 0.7.0 I found this issue:

This happens with a job with Count > 1, Service, Docker driver

Deploy job update with Update stanza
only first task is allocated
deployment does not finish

In order to finish the deployment, I have to _promote_ the deployment via Jippi-ui, the API returns an error - but the deployment then continues and finishes successfully.

We had approximately 15 jobs waiting for deployment to end with the above issue. After I rolled the cluster back to 0.7.0-rc1 binaries all the jobs immediately continued deploying (and finished) once the leader was on rc1.

stagneeds-investigation themdeployments typbug

Source

discobean

👍3

Most helpful comment

Hi, same issue here.

master upgraded to 0.7.0 from 0.7.0-rc1. Agents upgraded from 0.7.0-beta1 to 0.7.0.

With count = 4, I ran a nomad run, only 1 alloc was marked complete and 1 new alloc has started.
I had to run 4 times my nomad run to have the 4 new allocs started.

commarla on 8 Nov 2017

👍2

All 16 comments

When trying to replicate this on a local -dev environment with one local instance, 0.7.0 didn't have any issues. So must be something related to our cluster.

If it helps at all our agents were running _0.7.0-beta1_ at the time.. Could it be related?

discobean on 8 Nov 2017

Hi, same issue here.

master upgraded to 0.7.0 from 0.7.0-rc1. Agents upgraded from 0.7.0-beta1 to 0.7.0.

With count = 4, I ran a nomad run, only 1 alloc was marked complete and 1 new alloc has started.
I had to run 4 times my nomad run to have the 4 new allocs started.

commarla on 8 Nov 2017

👍2

I confirm that the nomad deployment promote workaround is working despite of the error (Error promoting deployment: Unexpected response code: 500 (rpc error: no canaries to promote)).

commarla on 8 Nov 2017

We'll definitely investigate and get this fixed ASAP. Could you include any relevant details possible? Job files or at least update stanzas. Logs or at least any deployment related lines or lines relating to the allocations involved.

Thanks for the reports! We're eager to get to the bottom of this!

schmichael on 8 Nov 2017

@discobean @commarla I would also love to see what the alloc-status or the allocation via the API looked like when it was stuck. We emit events that say why the alloc is marked as healthy or unhealthy.

dadgar on 8 Nov 2017

Here is my update stanza

update {
stagger           = "30s"
health_check      = "checks"
max_parallel      = "1"
min_healthy_time  = "10s"
healthy_deadline  = "5m"
auto_revert       = "true"
canary            = "0"
}

Same issue with just

update {
stagger           = "30s"
max_parallel      = "1"
}

I have add a fake env. var. to one of my services with a count = 5.

❯ nomad plan fakeservice.hcl
+/- Job: "service"
+/- Task Group: "fakeservice" (1 create/destroy update, 4 ignore)
  +/- Task: "fakeservice" (forces create/destroy update)
    + Env[TEST]: "LOL"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 29474
To submit the job with version verification run:

I run nomad run fakeservice.hcl only one alloc is started.

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created At
02a32350  1d42ea9e fakeservice  7        run      running   11/08/17 21:43:43 CET
a181b442  c4431904  fakeservice  6        run      running   11/08/17 17:13:04 CET
079220ed  f86b1151  fakeservice  6        run      running   11/08/17 17:12:32 CET
85de9fd9  79428961  fakeservice  6        run      running   11/08/17 17:11:55 CET
074559ca  d7f7710e  fakeservice  6        run      running   11/08/17 17:11:23 CET
e6a22153  1d42ea9e  fakeservice  6        stop     complete  11/08/17 17:09:01 CET

A gist with the alloc status via the API : https://gist.github.com/commarla/57b1a7dae6cea5a4f6af665b742b1ab4

The deployment status :

❯ nomad deployment status 15e63088
ID          = 15e63088
Job ID      = fakeservice
Job Version = 7
Status      = running
Description = Deployment is running

Deployed
Task Group                 Desired  Placed  Healthy  Unhealthy
fakeservice  5        1       1        0

Then I run

nomad deployment promote 15e63088
Error promoting deployment: Unexpected response code: 500 (rpc error: no canaries to promote)

And the deployment continue :

 nomad deployment status 15e63088
ID          = 15e63088
Job ID      = fakeservice
Job Version = 7
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group                 Desired  Placed  Healthy  Unhealthy
fakeservice  5        5       5        0

Thanks!

commarla on 8 Nov 2017

I have the same update stanza as @commarla the only difference is we use the Http API for deployments.

discobean on 9 Nov 2017

@discobean @commarla Does this happening on a new 0.7 cluster as well or only from the rc1 upgrade path?

dadgar on 11 Nov 2017

@dadgar yes it does. We started a new 3 nodes cluster in 0.7.
Our clients are in 0.7 now but most of our jobs were start when our clients were in beta. We drained the old ones to restart the jobs on the new ones.

commarla on 12 Nov 2017

@dadgar It was an upgraded cluster.

I also tried stopping/GC and remove the job, then re-add it, but the result was the same.

discobean on 13 Nov 2017

@discobean @commarla Would it be possible to see server logs from the misbehaving 0.7.0 cluster

dadgar on 13 Nov 2017

Hi @dadgar

I have no log during the nomad run only this one during our workaround promote

Nov 13 09:43:54 admin-10-32-151-102 nomad[30298]: 2017/11/13 09:43:54 [ERR] nomad.fsm: UpsertDeploymentPromotion failed: no canaries to promote
Nov 13 09:43:54 admin-10-32-151-102 nomad[30298]: 2017/11/13 09:43:54.848116 [ERR] http: Request /v1/deployment/promote/233039b3-f6ba-130c-c1bb-135b46b1a318, error: no canaries to promote

commarla on 13 Nov 2017

Today we tried upgrading again, this time going one by one from rc1 -> rc2 -> rc3 -> 0.7.0, and couldn't replicate the issue now.

We do shutdown the agents each night so maybe the problem will appear again tomorrow - I'll be sure to update.

discobean on 14 Nov 2017

@discobean @commarla I would love some reproduction steps. To be clear a setup to reproduce this starting from scratch with clear steps.

I spent quite a bit of time trying to reproduce this today and I couldn't. To reproduce I tried:

Starting a cluster of Nomad 0.7.0-rc1 servers and beta1 clients since that is what you both had.
Running jobs.
Upgrading servers to 0.7.0
Running more jobs to trigger deployments.

Further I tried the above with all 0.7.0-rc1 and also beta1. I also tried causing various leader elections during the process. Nothing I did could cause what you all are seeing.

I also played around with it in a fresh 0.7.0 cluster (no upgrades) and it worked properly as well.

dadgar on 14 Nov 2017

@dadgar, I can no longer reproduce it either. Even after the cluster just came up this morning its all working as expected right now - I have no leads at all.

discobean on 14 Nov 2017

Okay I am going to close this. Please bump the issue if this comes up again. If it does please collect all the server logs as well as the allocation information for the misbehaving job both before and after using the promote work around.

dadgar on 16 Nov 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings