Hey,
Using nomad version 0.11.3 and agents with an AWS ebs volume attached, while trying to drain the node, sometimes (happens very often) the volume can not be released from the ec2 instance causing the following error during rescheduling:
failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: volume "vol-0e5f3d9ecd9a" is already published at node "i-0cf3069676ehgy5i2" but with capabilities or a read_only setting incompatible with this request: rpc error: code = AlreadyExists desc = Resource already exists
Amazon Linux 2 AMI 2.0.20190313 x86_64 HVM gp2
Best regards,
Mishel Liberman
Hi @mishel170! A couple questions so that I can pin down the behavior:
system jobs?Possibly related: https://github.com/hashicorp/nomad/issues/8080
Possibly related: #8080
Yes it seems to be the similar
Although I haven't seen that there is some race condition of attach and detach..
attaching controllers logs:
I0622 19:00:58.428362 1 driver.go:62] Driver: ebs.csi.aws.com Version: v0.6.0-dirty
I0622 19:01:01.540337 1 driver.go:132] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0622 19:01:03.453981 1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:01:27.901009 1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0e5f3d9ecd9abf8ec NodeId:i-01861d5a3a944f4ca VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER > Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:01:32.091877 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0e5f3d9ecd9abf8ec
E0622 19:01:32.091900 1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
I0622 19:01:33.455003 1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:02:03.455835 1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:02:32.119677 1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0e5f3d9ecd9abf8ec NodeId:i-05faceeb37923be3a VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER > Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:02:33.205197 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0e5f3d9ecd9abf8ec
E0622 19:02:33.205225 1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
I0622 19:02:33.456683 1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:03:03.457589 1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:03:33.458441 1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:04:03.459431 1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
Hi @mishel170! A couple questions so that I can pin down the behavior:
- Are the node plugins being run as
systemjobs?- Is this happening in cases where the controller plugin was on the drained node?
They run as system jobs and it happens when the controller runs on the same node
Thanks for helping
@mishel170 I'm still investigating how to fix this, but in the meantime I do want to give you what I think will work as a workaround for this class of problem.
-ignore-system flag while node draining to ensure that the node plugin stays on the host even after it's done draining so as to avoid the race with the volumewatcher.thanks @tgross
Unfortunately running the controller as system job didn't help
I am detaching manually the volume on each failover of a node until the problem is fixed as a workaround
Looking forward to an update,
thanks.
It looks like there's a couple different problems in the node drain scenario:
volumewatcher). Whereas even if we have the node plugin running as a system job, the node plugin will be stopped as soon as the volume-claiming allocation stops and we won't be able to send it detach RPCs.What I'm attempting to do to fix this:
csi_hook.go Postrun step:CSIVolumeClaim struct to include the external node ID, so that we don't have to rely on the node plugin to be running to execute ControllerUnpublish.volumewatcher to change how we handle claim changes as well.I wanted to give a little status update on where this is. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:
I've tested this out and the results are... better? Not as good as I'd like yet. As explained in https://github.com/hashicorp/nomad/issues/8232#issuecomment-665099966, the first replacement allocation that gets placed during a node drain fails and we can't do much about that at the moment. That error looks like this:
2020-08-07T14:55:35-04:00 Setup Failure failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: rpc error: controller publish: attach volume: rpc error: controller attach volume: volume "vol-07b2ea432ef9a8dce" is already published at node "i-0685ed47dc21535db" but with capabilities or a read_only setting incompatible with this request: rpc error: code = AlreadyExists desc = Resource already exists
When this happens, we're still in the process of detaching the volume from the other node. But the _next_ allocation should succeed to place but instead we're getting an error at first. But the error recovers after several attempts. From the logs I suspect we're seeing an error because the kernel hasn't picked up that the storage provider attached the device yet (or the EC2 API hasn't made the device available yet, in this case) . Because there's only a single claim on the volume we're going thru the whole controller detach cycle with each attempt. I'll be investigating that on Monday to see if we can improve the situation at all.
In the example below, the drained node is 6f643c28:
Allocations
ID Node ID Task Group Version Desired Status Created Modified
3f8119ed 03e77a2f group 0 run running 1m7s ago 52s ago
078de3a2 03e77a2f group 0 stop failed 3m7s ago 1m7s ago
b5edab42 03e77a2f group 0 stop failed 4m7s ago 3m7s ago
3938e036 03e77a2f group 0 stop failed 4m43s ago 4m7s ago
8745a66d 6f643c28 group 0 stop complete 5m39s ago 4m24s ago
Allocations b5edab42 and 078de3a2 get the following error at the client from the NodeStageVolume RPC to the plugin:
2020-08-07T14:56:05-04:00 Setup Failure failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = InvalidArgument desc = Device path not provided
Logs from the node plugin for the first failed allocation 3938e036:
I0807 18:57:05.467988 1 node.go:94] NodeStageVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce PublishContext:map[] StagingTargetPath:/csi/staging/ebs-vol0/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER > Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:57:05.468081 1 node.go:143] NodeStageVolume: volume="vol-07b2ea432ef9a8dce" operation finished
I0807 18:57:05.468105 1 node.go:145] donedone
E0807 18:57:05.468113 1 driver.go:109] GRPC error: rpc error: code = InvalidArgument desc = Device path not provided
vs the successful run for allocation 3f8119ed:
I0807 18:59:09.316753 1 node.go:94] NodeStageVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce PublishContext:map[devicePath:/dev/xvdba] StagingTargetPath:/csi/staging/ebs-vol0/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER > Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:59:09.316876 1 node.go:158] NodeStageVolume: find device path /dev/xvdba -> /dev/xvdba
I0807 18:59:09.317095 1 node.go:193] NodeStageVolume: formatting /dev/xvdba and mounting at /csi/staging/ebs-vol0/rw-file-system-single-node-writer with fstype ext4
I0807 18:59:09.317109 1 mount_linux.go:274] Checking for issues with fsck on disk: /dev/xvdba
I0807 18:59:09.340800 1 mount_linux.go:293] Attempting to mount disk: ext4 /dev/xvdba /csi/staging/ebs-vol0/rw-file-system-single-node-writer
I0807 18:59:09.340836 1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t ext4 -o defaults /dev/xvdba /csi/staging/ebs-vol0/rw-file-system-single-node-writer])
I0807 18:59:09.349363 1 node.go:143] NodeStageVolume: volume="vol-07b2ea432ef9a8dce" operation finished
I0807 18:59:09.349465 1 node.go:145] donedone
I0807 18:59:09.351076 1 node.go:279] NodePublishVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce PublishContext:map[devicePath:/dev/xvdba] StagingTargetPath:/csi/staging/ebs-vol0/rw-file-system-single-node-writer TargetPath:/csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER > Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:59:09.351129 1 node.go:440] NodePublishVolume: creating dir /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer
I0807 18:59:09.351282 1 node.go:450] NodePublishVolume: mounting /csi/staging/ebs-vol0/rw-file-system-single-node-writer at /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer with option [bind] as fstype ext4
I0807 18:59:09.351299 1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t ext4 -o bind /csi/staging/ebs-vol0/rw-file-system-single-node-writer /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer])
I0807 18:59:09.353033 1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t ext4 -o bind,remount /csi/staging/ebs-vol0/rw-file-system-single-node-writer /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer])
There are no other errors at the client The controller logs all seem fine.
controller logs a634b3c5
initial failed attempt:
I0807 18:55:35.405415 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-07b2ea432ef9a8dce
E0807 18:55:35.405445 1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
I0807 18:55:47.297624 1 controller.go:292] ControllerUnpublishVolume: volume vol-07b2ea432ef9a8dce detached from node i-01c7526848c0ba28a
I0807 18:55:47.304175 1 controller.go:292] ControllerUnpublishVolume: volume vol-07b2ea432ef9a8dce detached from node i-01c7526848c0ba28a
successful attempt from controller:
I0807 18:59:05.493856 1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce NodeId:i-0685ed47dc21535db VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER > Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:59:06.179736 1 cloud.go:364] AttachVolume volume="vol-07b2ea432ef9a8dce" instance="i-0685ed47dc21535db" request returned {
AttachTime: 2020-08-07 18:59:06.142 +0000 UTC,
Device: "/dev/xvdba",
InstanceId: "i-0685ed47dc21535db",
State: "attaching",
VolumeId: "vol-07b2ea432ef9a8dce"
}
I0807 18:59:09.310673 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-07b2ea432ef9a8dce
I0807 18:59:09.310710 1 controller.go:268] ControllerPublishVolume: volume vol-07b2ea432ef9a8dce attached to node i-0685ed47dc21535db through device /dev/xvdba
controller logs 7c4c4267
successful attempt:
I0807 18:54:33.408268 1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce NodeId:i-01c7526848c0ba28a VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER > Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:54:33.942156 1 cloud.go:364] AttachVolume volume="vol-07b2ea432ef9a8dce" instance="i-01c7526848c0ba28a" request returned {
AttachTime: 2020-08-07 18:54:33.904 +0000 UTC,
Device: "/dev/xvdba",
InstanceId: "i-01c7526848c0ba28a",
State: "attaching",
VolumeId: "vol-07b2ea432ef9a8dce"
}
I0807 18:54:40.425670 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-07b2ea432ef9a8dce
I0807 18:54:40.425698 1 controller.go:268] ControllerPublishVolume: volume vol-07b2ea432ef9a8dce attached to node i-01c7526848c0ba28a through device /dev/xvdba
I've also opened https://github.com/hashicorp/nomad/issues/8609 to cover the issue that we're not going to be able to fix in the 0.12.x timeframe.
Ok, one last item that isn't going to make it into the 0.12.2 release but can be worked around is that if you have HA controllers you need to make sure they don't land on the same client during node drains: https://github.com/hashicorp/nomad/issues/8628
But otherwise testing is showing that the work that's landed for 0.12.2 should close this out. That will be released shortly.