Nomad servers and clients both running this version
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)
Amazon Linux 2:
4.14.173-137.229.amzn2.x86_64
1 or 3 Nomad servers (have tested with both sizes of clusters)
After deregistering a volume, the CSIVolumeGC evaluation will continue to run to check that volume and will fail with "volume not found". This happens consistently on the cluster I've been using to do CSI testing on and it seems like it's being persisted in the Raft state somewhere since I've tried restart the cluster, resizing the cluster, even modifying the evaluation code to always pass on these volume failures but upon restarting with 0.11.2 code these old volumes will continue to fail to be GC'd.
This was noticed when it took down our development servers since we had quite a few volumes we deregistered and on server startup the leader will try to process all of them and run out of CPU.
Follow the guide here
Run nomad volume deregister mysql
The Nomad server logs will periodically have the errors below with seemingly no way to stop them.
```
nomad.fsm: CSIVolumeClaim failed: error="volume not found: mysql"
worker: error invoking scheduler: error="failed to process evaluation: volume not found: mysql"
````
Hey @kainoaseto we just release 0.11.3 which has some improvements to the GC loop. Can you give that a try to see if that can clean these up?
I'm running a mix of 0.11.3 and 0.12.0 currently. The leader is 0.12.0 at the moment, and I see in its logs:
2020-07-18T20:26:52.021Z [WARN] nomad: eval reached delivery limit, marking as failed: eval="<Eval "799729e4-90b9-9646-4fe2-6b2a39e26508" JobID: "csi-volume-claim-gc:data-test" Namespace: "default">"
2020-07-18T20:26:52.380Z [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume not found: data-test"
Wanted to give a quick status update. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:
I believe these fixes combined should get us into pretty good shape, and #8584 will give you an escape hatch to manually detach the volume via nomad volume detach once that's merged.
Thank you @tgross ! This is a really exciting development and we are really looking forward to testing out CSI again when 0.12.2 drops. We really appreciate the follow up on these issues and all the work you have all done to stabilize CSI, this is what keeps us coming back to Nomad time and again.
Thank you!
Just to note: the volume I see in the log message above doesn't appear in nomad volume status, so I don't know if nomad volume detach would help me, but hopefully one of the other changes will fix it (#8605 perhaps?)
I've closed #8285, #8145, and #8057 as duplicates of this issue; I'll continue to collect status updates here as we wrap up testing for 0.12.2.
Testing for 0.12.2 looks good. Going to close this issue out, and 0.12.2 will be shipped shortly.
Most helpful comment
Wanted to give a quick status update. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:
I believe these fixes combined should get us into pretty good shape, and #8584 will give you an escape hatch to manually detach the volume via
nomad volume detachonce that's merged.