Nomad: Volume is not released during node drain, causing job to fail forever

Created on 20 Jun 2020  路  9Comments  路  Source: hashicorp/nomad

Hey,
Using nomad version 0.11.3 and agents with an AWS ebs volume attached, while trying to drain the node, sometimes (happens very often) the volume can not be released from the ec2 instance causing the following error during rescheduling:

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: volume "vol-0e5f3d9ecd9a" is already published at node "i-0cf3069676ehgy5i2" but with capabilities or a read_only setting incompatible with this request: rpc error: code = AlreadyExists desc = Resource already exists

Operating system and Environment details

Amazon Linux 2 AMI 2.0.20190313 x86_64 HVM gp2

Best regards,
Mishel Liberman

themdrain themstorage typbug

All 9 comments

Hi @mishel170! A couple questions so that I can pin down the behavior:

  • Are the node plugins being run as system jobs?
  • Is this happening in cases where the controller plugin was on the drained node?

Possibly related: #8080

Yes it seems to be the similar
Although I haven't seen that there is some race condition of attach and detach..
attaching controllers logs:


I0622 19:00:58.428362       1 driver.go:62] Driver: ebs.csi.aws.com Version: v0.6.0-dirty
I0622 19:01:01.540337       1 driver.go:132] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0622 19:01:03.453981       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:01:27.901009       1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0e5f3d9ecd9abf8ec NodeId:i-01861d5a3a944f4ca VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:01:32.091877       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0e5f3d9ecd9abf8ec
E0622 19:01:32.091900       1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
I0622 19:01:33.455003       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:02:03.455835       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:02:32.119677       1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0e5f3d9ecd9abf8ec NodeId:i-05faceeb37923be3a VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:02:33.205197       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0e5f3d9ecd9abf8ec
E0622 19:02:33.205225       1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
I0622 19:02:33.456683       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:03:03.457589       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:03:33.458441       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:04:03.459431       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Hi @mishel170! A couple questions so that I can pin down the behavior:

  • Are the node plugins being run as system jobs?
  • Is this happening in cases where the controller plugin was on the drained node?

They run as system jobs and it happens when the controller runs on the same node

Thanks for helping

@mishel170 I'm still investigating how to fix this, but in the meantime I do want to give you what I think will work as a workaround for this class of problem.

  • Run multiple controller plugins, spread across the cluster such that a node that drains always has a controller to talk to.
  • Run the node plugins as system jobs, as you're doing
  • Use the -ignore-system flag while node draining to ensure that the node plugin stays on the host even after it's done draining so as to avoid the race with the volumewatcher.

thanks @tgross

Unfortunately running the controller as system job didn't help
I am detaching manually the volume on each failover of a node until the problem is fixed as a workaround

Looking forward to an update,
thanks.

It looks like there's a couple different problems in the node drain scenario:

  • The entire NodeUnpublish, NodeUnstage, ControllerUnpublish workflow runs async to the draining (or termination) of a client node (in volumewatcher). Whereas even if we have the node plugin running as a system job, the node plugin will be stopped as soon as the volume-claiming allocation stops and we won't be able to send it detach RPCs.
  • Also, the controller detach RPC needs the external node ID, so we look that up from the node plugin's fingerprint. Unfortunately, this is both racy with the previous step and prevents us from performing a cleanup controller detachment after the fact.
  • Migrating allocations during a node drain waits for the allocation to be placed on the new node before shutting down the drained allocation. This will always result in an error and a restart of the new allocation, even if we fix everything else. Note that for the typical case where a volume can only be claimed by a single allocation, migrating the application to a new node will always create an outage for the application no matter what we could come up with Nomad. So avoiding the error is a nice-to-have in that circumstance but it doesn't fundamentally change anything.

What I'm attempting to do to fix this:

  • in the client's csi_hook.go Postrun step:

    • ~trigger the NodeUnpublish and NodeUnstage steps locally,~ (these need to happen on the server as well)

    • then send an RPC back to the server that triggers a ControllerUnpublish. This will be a _synchronous_ call so that we can block the node drain until it's complete

  • update our CSIVolumeClaim struct to include the external node ID, so that we don't have to rely on the node plugin to be running to execute ControllerUnpublish.
  • This will need to include a few modifications to volumewatcher to change how we handle claim changes as well.
  • We'll still need the volume GC job to clean up in cases where the node is just gone and never coming back. In that scenario, we'll need to be able to gracefully handle errors from the missing node plugins and the controller if the underlying host is gone as well.

I wanted to give a little status update on where this is. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:

  • #8561 retries controller RPCs so that we take advantage of controllers deployed in a HA configuration.
  • #8572 #8579 are some improved plumbing that makes volume claim reaping synchronous in common cases, which reduces the number of places where things can go wrong (and makes it easier to reason about).
  • #8580 uses that plumbing to drive the volume claim unpublish step from the client, so that in most cases (except when we lose touch with the client) we're running the volume unpublish synchronously as part of allocation shutdown.
  • #8605 improves our error handling so that the checkpointing we do will work correctly by ignoring "you already did that" errors.
  • #8607 fixes some missing ACLs and Region flags in the Nomad leader

I've tested this out and the results are... better? Not as good as I'd like yet. As explained in https://github.com/hashicorp/nomad/issues/8232#issuecomment-665099966, the first replacement allocation that gets placed during a node drain fails and we can't do much about that at the moment. That error looks like this:

2020-08-07T14:55:35-04:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: rpc error: controller publish: attach volume: rpc error: controller attach volume: volume "vol-07b2ea432ef9a8dce" is already published at node "i-0685ed47dc21535db" but with capabilities or a read_only setting incompatible with this request: rpc error: code = AlreadyExists desc = Resource already exists

When this happens, we're still in the process of detaching the volume from the other node. But the _next_ allocation should succeed to place but instead we're getting an error at first. But the error recovers after several attempts. From the logs I suspect we're seeing an error because the kernel hasn't picked up that the storage provider attached the device yet (or the EC2 API hasn't made the device available yet, in this case) . Because there's only a single claim on the volume we're going thru the whole controller detach cycle with each attempt. I'll be investigating that on Monday to see if we can improve the situation at all.

In the example below, the drained node is 6f643c28:

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
3f8119ed  03e77a2f  group       0        run      running   1m7s ago   52s ago
078de3a2  03e77a2f  group       0        stop     failed    3m7s ago   1m7s ago
b5edab42  03e77a2f  group       0        stop     failed    4m7s ago   3m7s ago
3938e036  03e77a2f  group       0        stop     failed    4m43s ago  4m7s ago
8745a66d  6f643c28  group       0        stop     complete  5m39s ago  4m24s ago

Allocations b5edab42 and 078de3a2 get the following error at the client from the NodeStageVolume RPC to the plugin:

2020-08-07T14:56:05-04:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = InvalidArgument desc = Device path not provided

Logs from the node plugin for the first failed allocation 3938e036:

I0807 18:57:05.467988       1 node.go:94] NodeStageVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce PublishContext:map[] StagingTargetPath:/csi/staging/ebs-vol0/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:57:05.468081       1 node.go:143] NodeStageVolume: volume="vol-07b2ea432ef9a8dce" operation finished
I0807 18:57:05.468105       1 node.go:145] donedone
E0807 18:57:05.468113       1 driver.go:109] GRPC error: rpc error: code = InvalidArgument desc = Device path not provided

vs the successful run for allocation 3f8119ed:

I0807 18:59:09.316753       1 node.go:94] NodeStageVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce PublishContext:map[devicePath:/dev/xvdba] StagingTargetPath:/csi/staging/ebs-vol0/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:59:09.316876       1 node.go:158] NodeStageVolume: find device path /dev/xvdba -> /dev/xvdba
I0807 18:59:09.317095       1 node.go:193] NodeStageVolume: formatting /dev/xvdba and mounting at /csi/staging/ebs-vol0/rw-file-system-single-node-writer with fstype ext4
I0807 18:59:09.317109       1 mount_linux.go:274] Checking for issues with fsck on disk: /dev/xvdba
I0807 18:59:09.340800       1 mount_linux.go:293] Attempting to mount disk: ext4 /dev/xvdba /csi/staging/ebs-vol0/rw-file-system-single-node-writer
I0807 18:59:09.340836       1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t ext4 -o defaults /dev/xvdba /csi/staging/ebs-vol0/rw-file-system-single-node-writer])
I0807 18:59:09.349363       1 node.go:143] NodeStageVolume: volume="vol-07b2ea432ef9a8dce" operation finished
I0807 18:59:09.349465       1 node.go:145] donedone
I0807 18:59:09.351076       1 node.go:279] NodePublishVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce PublishContext:map[devicePath:/dev/xvdba] StagingTargetPath:/csi/staging/ebs-vol0/rw-file-system-single-node-writer TargetPath:/csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:59:09.351129       1 node.go:440] NodePublishVolume: creating dir /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer
I0807 18:59:09.351282       1 node.go:450] NodePublishVolume: mounting /csi/staging/ebs-vol0/rw-file-system-single-node-writer at /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer with option [bind] as fstype ext4
I0807 18:59:09.351299       1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t ext4 -o bind /csi/staging/ebs-vol0/rw-file-system-single-node-writer /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer])
I0807 18:59:09.353033       1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t ext4 -o bind,remount /csi/staging/ebs-vol0/rw-file-system-single-node-writer /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer])

There are no other errors at the client The controller logs all seem fine.

controller logs a634b3c5

initial failed attempt:

I0807 18:55:35.405415       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-07b2ea432ef9a8dce
E0807 18:55:35.405445       1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
I0807 18:55:47.297624       1 controller.go:292] ControllerUnpublishVolume: volume vol-07b2ea432ef9a8dce detached from node i-01c7526848c0ba28a
I0807 18:55:47.304175       1 controller.go:292] ControllerUnpublishVolume: volume vol-07b2ea432ef9a8dce detached from node i-01c7526848c0ba28a

successful attempt from controller:

I0807 18:59:05.493856       1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce NodeId:i-0685ed47dc21535db VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:59:06.179736       1 cloud.go:364] AttachVolume volume="vol-07b2ea432ef9a8dce" instance="i-0685ed47dc21535db" request returned {
  AttachTime: 2020-08-07 18:59:06.142 +0000 UTC,
  Device: "/dev/xvdba",
  InstanceId: "i-0685ed47dc21535db",
  State: "attaching",
  VolumeId: "vol-07b2ea432ef9a8dce"
}
I0807 18:59:09.310673       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-07b2ea432ef9a8dce
I0807 18:59:09.310710       1 controller.go:268] ControllerPublishVolume: volume vol-07b2ea432ef9a8dce attached to node i-0685ed47dc21535db through device /dev/xvdba

controller logs 7c4c4267

successful attempt:

I0807 18:54:33.408268       1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce NodeId:i-01c7526848c0ba28a VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:54:33.942156       1 cloud.go:364] AttachVolume volume="vol-07b2ea432ef9a8dce" instance="i-01c7526848c0ba28a" request returned {
  AttachTime: 2020-08-07 18:54:33.904 +0000 UTC,
  Device: "/dev/xvdba",
  InstanceId: "i-01c7526848c0ba28a",
  State: "attaching",
  VolumeId: "vol-07b2ea432ef9a8dce"
}
I0807 18:54:40.425670       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-07b2ea432ef9a8dce
I0807 18:54:40.425698       1 controller.go:268] ControllerPublishVolume: volume vol-07b2ea432ef9a8dce attached to node i-01c7526848c0ba28a through device /dev/xvdba

I've also opened https://github.com/hashicorp/nomad/issues/8609 to cover the issue that we're not going to be able to fix in the 0.12.x timeframe.

Ok, one last item that isn't going to make it into the 0.12.2 release but can be worked around is that if you have HA controllers you need to make sure they don't land on the same client during node drains: https://github.com/hashicorp/nomad/issues/8628

But otherwise testing is showing that the work that's landed for 0.12.2 should close this out. That will be released shortly.

Was this page helpful?
0 / 5 - 0 ratings