Nomad: CSI: Volume GC Evaluation Fails on Deregistered Volumes

Created on 3 Jun 2020  路  7Comments  路  Source: hashicorp/nomad

Nomad version

Nomad servers and clients both running this version
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)

Operating system and Environment details

Amazon Linux 2:
4.14.173-137.229.amzn2.x86_64

1 or 3 Nomad servers (have tested with both sizes of clusters)

Issue

After deregistering a volume, the CSIVolumeGC evaluation will continue to run to check that volume and will fail with "volume not found". This happens consistently on the cluster I've been using to do CSI testing on and it seems like it's being persisted in the Raft state somewhere since I've tried restart the cluster, resizing the cluster, even modifying the evaluation code to always pass on these volume failures but upon restarting with 0.11.2 code these old volumes will continue to fail to be GC'd.

This was noticed when it took down our development servers since we had quite a few volumes we deregistered and on server startup the leader will try to process all of them and run out of CPU.

Reproduction steps

  1. Follow the guide here

  2. Run nomad volume deregister mysql

  3. The Nomad server logs will periodically have the errors below with seemingly no way to stop them.

Nomad Server logs (if appropriate)

```
nomad.fsm: CSIVolumeClaim failed: error="volume not found: mysql"
worker: error invoking scheduler: error="failed to process evaluation: volume not found: mysql"
````

themstorage typbug

Most helpful comment

Wanted to give a quick status update. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:

  • #8561 retries controller RPCs so that we take advantage of controllers deployed in a HA configuration.
  • #8572 #8579 are some improved plumbing that makes volume claim reaping synchronous in common cases, which reduces the number of places where things can go wrong (and makes it easier to reason about).
  • #8580 uses that plumbing to drive the volume claim unpublish step from the client, so that in most cases (except when we lose touch with the client) we're running the volume unpublish synchronously as part of allocation shutdown.
  • #8605 improves our error handling so that the checkpointing we do will work correctly by ignoring "you already did that" errors.
  • #8607 fixes some missing ACLs and Region flags in the Nomad leader

I believe these fixes combined should get us into pretty good shape, and #8584 will give you an escape hatch to manually detach the volume via nomad volume detach once that's merged.

All 7 comments

Hey @kainoaseto we just release 0.11.3 which has some improvements to the GC loop. Can you give that a try to see if that can clean these up?

I'm running a mix of 0.11.3 and 0.12.0 currently. The leader is 0.12.0 at the moment, and I see in its logs:

2020-07-18T20:26:52.021Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval "799729e4-90b9-9646-4fe2-6b2a39e26508" JobID: "csi-volume-claim-gc:data-test" Namespace: "default">"
2020-07-18T20:26:52.380Z [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume not found: data-test"

Wanted to give a quick status update. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:

  • #8561 retries controller RPCs so that we take advantage of controllers deployed in a HA configuration.
  • #8572 #8579 are some improved plumbing that makes volume claim reaping synchronous in common cases, which reduces the number of places where things can go wrong (and makes it easier to reason about).
  • #8580 uses that plumbing to drive the volume claim unpublish step from the client, so that in most cases (except when we lose touch with the client) we're running the volume unpublish synchronously as part of allocation shutdown.
  • #8605 improves our error handling so that the checkpointing we do will work correctly by ignoring "you already did that" errors.
  • #8607 fixes some missing ACLs and Region flags in the Nomad leader

I believe these fixes combined should get us into pretty good shape, and #8584 will give you an escape hatch to manually detach the volume via nomad volume detach once that's merged.

Thank you @tgross ! This is a really exciting development and we are really looking forward to testing out CSI again when 0.12.2 drops. We really appreciate the follow up on these issues and all the work you have all done to stabilize CSI, this is what keeps us coming back to Nomad time and again.

Thank you!

Just to note: the volume I see in the log message above doesn't appear in nomad volume status, so I don't know if nomad volume detach would help me, but hopefully one of the other changes will fix it (#8605 perhaps?)

I've closed #8285, #8145, and #8057 as duplicates of this issue; I'll continue to collect status updates here as we wrap up testing for 0.12.2.

Testing for 0.12.2 looks good. Going to close this issue out, and 0.12.2 will be shipped shortly.

Was this page helpful?
0 / 5 - 0 ratings