Nomad servers and clients both running this version
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)
Amazon Linux 2:
4.14.173-137.229.amzn2.x86_64
EBS CSI Plugin:
ID = aws-ebs1
Provider = ebs.csi.aws.com
Version = v0.6.0-dirty
Controllers Healthy = 1
Controllers Expected = 1
Nodes Healthy = 4
Nodes Expected = 4
What I've observed is that when a nomad client is drained along with its system jobs (nomad client will be shutdown or the node is marked ineligible and draining for both system jobs and normal jobs occurs) the CSI Volume gets into a bad state.
When the job that owns the CSI volume is being removed from the node, the ControllerUnpublishVolume RPC call is executed but is canceled via context:
ebs-csi-controller-plugin-service E0528 00:06:23.578472 1 driver.go:109] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-0c17xxxxxxxxxxxxx" from node "i-0e58xxxxxxxxxxxxx": RequestCanceled: request context canceled
ebs-csi-controller-plugin-service I0528 00:06:06.271365 1 controller.go:275] ControllerUnpublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0e58xxxxxxxxxxxxx Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
Because this call never completes the EBS volume is still attached to the old node/instance. When the job is scheduled onto another node the controller gets a call to ControllerPublishVolume but this fails with the AlreadyExists error:
ebs-csi-controller-plugin-service I0528 00:07:42.336709 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0c17xxxxxxxxxxxxx
ebs-csi-controller-plugin-service E0528 00:07:42.336732 1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
ebs-csi-controller-plugin-service I0528 00:07:41.953571 1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0394xxxxxxxxxxxxx VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER > Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I looked into this error a little bit and according to the CSI spec it should "Indicate that a volume corresponding to the specified volume_id has already been published at the node corresponding to the specified node_id but is incompatible with the specified volume_capability or readonly flag."
However, this is actually implemented incorrectly in the CSI EBS plugin. It seems like they should be throwing the FAILED_PRECONDITION error instead since that seems to more accurately reflect that the volume is still attached on a different node under the CSI Spec.
I've filed this issue on the EBS CSI plugin to get their input since it seems that might be out of spec. Here is also the line of code that shows why this error is being thrown (since the volume is just still attached to the other node).
Anyways, once that was cleared up looking back at the CSI controller logs I was able to see that Nomad seems to be calling ControllerUnpublishVolume but also ControllerPublishVolume before the unpublish is able to complete and free the volume. This results in this sometimes endless loop of the ControllerUnpublishVolume trying to complete before the ControllerPublishVolume executes and seems to cancel it. Which in some cases never happens and in others will resolve after a few tries. Here is a small snippet of logs to show this:
ebs-csi-controller-plugin-service E0528 00:09:29.862416 1 driver.go:109] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-0c17xxxxxxxxxxxxx" from node "i-0e58xxxxxxxxxxxxx": RequestCanceled: request context canceled caused by: context canceled
ebs-csi-controller-plugin-service I0528 00:09:27.636101 1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
ebs-csi-controller-plugin-service E0528 00:09:13.690648 1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
ebs-csi-controller-plugin-service I0528 00:09:13.690622 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0c17xxxxxxxxxxxxx
ebs-csi-controller-plugin-service I0528 00:09:13.279797 1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0394xxxxxxxxxxxxx VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER > Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
ebs-csi-controller-plugin-service I0528 00:09:12.449520 1 controller.go:275] ControllerUnpublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0e58xxxxxxxxxxxxx Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
Just to reiterate, this issue happens only when the volume is attached still to another Node (EC2 instance) and isn't able to be removed by the controller because the request is being cancelled. This will sometimes complete when the volume can be detached successfully before the ControllerPublishVolume call goes through but it seems somewhat random in how this happens from first observations.
The easiest way I've found to reproduce this issue somewhat consistently is by scheduling a job that uses a CSI volume and then doing a non forced 3 minute drain on the node that includes system jobs. When the job that needs the CSI volume is rescheduled to a new node in the same az, this issue can be observed.
job "plugin-aws-ebs-controller" {
datacenters = ["sandbox"]
group "controller" {
task "plugin" {
driver = "docker"
config {
image = "amazon/aws-ebs-csi-driver:latest"
args = [
"controller",
"--endpoint=unix://csi/csi.sock",
"--logtostderr",
"--v=5",
]
}
csi_plugin {
id = "aws-ebs1"
type = "controller"
mount_dir = "/csi"
}
resources {
cpu = 500
memory = 256
}
# ensuring the plugin has time to shut down gracefully
kill_timeout = "2m"
}
}
}
job "plugin-aws-ebs-nodes" {
datacenters = ["dc1"]
# you can run node plugins as service jobs as well, but this ensures
# that all nodes in the DC have a copy.
type = "system"
group "nodes" {
task "plugin" {
driver = "docker"
config {
image = "amazon/aws-ebs-csi-driver:latest"
args = [
"node",
"--endpoint=unix://csi/csi.sock",
"--logtostderr",
"--v=5",
]
# node plugins must run as privileged jobs because they
# mount disks to the host
privileged = true
}
csi_plugin {
id = "aws-ebs1"
type = "node"
mount_dir = "/csi"
}
resources {
cpu = 500
memory = 256
}
# ensuring the plugin has time to shut down gracefully
kill_timeout = "2m"
}
}
}
Register an EBS volume in the same region/az, I followed the steps here with us-west-2b-a as my volume id/name:
https://learn.hashicorp.com/nomad/stateful-workloads/csi-volumes
Register the mysql job:
job "mysql-server" {
datacenters = ["sandbox"]
type = "service"
group "mysql-server" {
constraint {
attribute = "${node.class}"
operator = "="
value = "experiment"
}
# filtering task group to the AZ our EBS volume is located in
constraint {
attribute = "${attr.platform.aws.placement.availability-zone}"
operator = "="
value = "us-west-2b"
}
count = 1
volume "mysql" {
type = "csi"
read_only = false
source = "us-west-2b-a"
}
restart {
attempts = 10
interval = "5m"
delay = "25s"
mode = "delay"
}
task "mysql-server" {
driver = "docker"
volume_mount {
volume = "mysql"
destination = "/srv"
read_only = false
}
env = {
"MYSQL_ROOT_PASSWORD" = "password"
}
config {
image = "hashicorp/mysql-portworx-demo:latest"
args = ["--datadir", "/srv/mysql"]
port_map {
db = 3306
}
}
resources {
cpu = 500
memory = 1024
network {
port "db" {
static = 3306
}
}
}
service {
name = "mysql-server"
port = "db"
check {
type = "tcp"
interval = "10s"
timeout = "2s"
}
}
}
}
}
May 25, '20 19:28:06 -0700 CSI failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
May 25, '20 19:27:51 -0700 Storage Unmount volume
May 25, '20 19:27:46 -0700 CSI failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
May 25, '20 19:27:46 -0700 Drain Node drain complete
May 25, '20 19:24:31 -0700 CSI failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
May 25, '20 19:24:29 -0700 Drain Node drain strategy set
May 25, '20 19:23:58 -0700 Cluster Node marked as ineligible for scheduling
May 25, '20 19:22:45 -0700 Storage Mount volume
ControllerPublishVolume and ControllerUnpublishVolume are being called and stepping on each other. However it might take some tries as each time there can be different results from what I've found. Sometimes it will self resolve if the unpublish is called early enough or if the volume is successfully detached when the node is draining.Setup Failure failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: rpc error: controller publish: attach volume: rpc error: controller attach volume: rpc error: code = AlreadyExists desc = Resource already exists
Received Task received by client
Hi @kainoaseto! I'm looking into this!
Possibly related: https://github.com/hashicorp/nomad/issues/8232
After doing a little bit more testing, it looks like the other failure mode we can see here is that the volumewatcher runs asynchronously, so it can try to NodeUnpublish _after_ the node plugin has been removed from the node, which results in an error like:
2020-06-23T13:08:37.919Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=ebs-vol0 error="1 error occurred:
* could not detach from node: rpc error: node detach volume: plugin aws-ebs0 for type csi-node not found
"
@kainoaseto I'm investigating how to fix this, but in the meantime I do want to give you what I think will work as a workaround for this class of problem.
-ignore-system flag while node draining to ensure that the node plugin stays on the host even after it's done draining so as to avoid the race with the volumewatcher.Hi @tgross Thank you for looking into this! Ah that's interesting, I believe I've ran into that one at some point in time with mixed results on reproducing it consistently. Interested to how that one will be resolved, I'm guessing some type of synchronous call or wait for all the volumes to be unpublished before completing the drain would need to happen.
Thank you for the work around! I will test that as a possible solution in the meantime. One problem that I might run into still, is that the nomad clients that are used for this cluster are spot instances (provisioned through spot.io/spotinst). We get about a 5 minute drain time before the instance gets terminated but I believe that forces system jobs to drain off as well which is another way I've been able to reproduce this issue. I'll see if it's possible to modify the drain to ignore system jobs for spot to help facilitate this fix but I'm not sure if that's possible.
The controllers I've tried running multiple of but if one of those gets shutdown (maybe the node it was on drained it to start up on a new node) I've ran into this bug: https://github.com/hashicorp/nomad/issues/8034 Although it looks like there is a workaround with restarting the controllers if they get into a bad state.
There's a status update on this issue here: https://github.com/hashicorp/nomad/issues/8232#issuecomment-670679576
For sake of our planning, I'm going to close this issue and we'll continue track the progress in https://github.com/hashicorp/nomad/issues/8232.