Nomad: CSI: Migrating Allocation w/ CSI Volume to New Node Fails

Created on 30 May 2020 · 6Comments · Source: hashicorp/nomad

Nomad version

Nomad servers and clients both running this version
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)

Operating system and Environment details

Amazon Linux 2:
4.14.173-137.229.amzn2.x86_64

EBS CSI Plugin:

ID                   = aws-ebs1
Provider             = ebs.csi.aws.com
Version              = v0.6.0-dirty
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 4
Nodes Expected       = 4

Issue

What I've observed is that when a nomad client is drained along with its system jobs (nomad client will be shutdown or the node is marked ineligible and draining for both system jobs and normal jobs occurs) the CSI Volume gets into a bad state.

When the job that owns the CSI volume is being removed from the node, the ControllerUnpublishVolume RPC call is executed but is canceled via context:

ebs-csi-controller-plugin-service E0528 00:06:23.578472 1 driver.go:109] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-0c17xxxxxxxxxxxxx" from node "i-0e58xxxxxxxxxxxxx": RequestCanceled: request context canceled
ebs-csi-controller-plugin-service I0528 00:06:06.271365 1 controller.go:275] ControllerUnpublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0e58xxxxxxxxxxxxx Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Because this call never completes the EBS volume is still attached to the old node/instance. When the job is scheduled onto another node the controller gets a call to ControllerPublishVolume but this fails with the AlreadyExists error:

ebs-csi-controller-plugin-service I0528 00:07:42.336709 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0c17xxxxxxxxxxxxx
ebs-csi-controller-plugin-service E0528 00:07:42.336732 1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
ebs-csi-controller-plugin-service I0528 00:07:41.953571 1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0394xxxxxxxxxxxxx VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

I looked into this error a little bit and according to the CSI spec it should "Indicate that a volume corresponding to the specified volume_id has already been published at the node corresponding to the specified node_id but is incompatible with the specified volume_capability or readonly flag."
However, this is actually implemented incorrectly in the CSI EBS plugin. It seems like they should be throwing the FAILED_PRECONDITION error instead since that seems to more accurately reflect that the volume is still attached on a different node under the CSI Spec.

I've filed this issue on the EBS CSI plugin to get their input since it seems that might be out of spec. Here is also the line of code that shows why this error is being thrown (since the volume is just still attached to the other node).

Anyways, once that was cleared up looking back at the CSI controller logs I was able to see that Nomad seems to be calling ControllerUnpublishVolume but also ControllerPublishVolume before the unpublish is able to complete and free the volume. This results in this sometimes endless loop of the ControllerUnpublishVolume trying to complete before the ControllerPublishVolume executes and seems to cancel it. Which in some cases never happens and in others will resolve after a few tries. Here is a small snippet of logs to show this:

ebs-csi-controller-plugin-service E0528 00:09:29.862416 1 driver.go:109] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-0c17xxxxxxxxxxxxx" from node "i-0e58xxxxxxxxxxxxx": RequestCanceled: request context canceled caused by: context canceled
ebs-csi-controller-plugin-service I0528 00:09:27.636101 1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
ebs-csi-controller-plugin-service E0528 00:09:13.690648 1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
ebs-csi-controller-plugin-service I0528 00:09:13.690622 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0c17xxxxxxxxxxxxx
ebs-csi-controller-plugin-service I0528 00:09:13.279797 1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0394xxxxxxxxxxxxx VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
ebs-csi-controller-plugin-service I0528 00:09:12.449520 1 controller.go:275] ControllerUnpublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0e58xxxxxxxxxxxxx Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Just to reiterate, this issue happens only when the volume is attached still to another Node (EC2 instance) and isn't able to be removed by the controller because the request is being cancelled. This will sometimes complete when the volume can be detached successfully before the ControllerPublishVolume call goes through but it seems somewhat random in how this happens from first observations.

Reproduction steps

The easiest way I've found to reproduce this issue somewhat consistently is by scheduling a job that uses a CSI volume and then doing a non forced 3 minute drain on the node that includes system jobs. When the job that needs the CSI volume is rescheduled to a new node in the same az, this issue can be observed.

Start the CSI controller plugin

job "plugin-aws-ebs-controller" {
  datacenters = ["sandbox"]

  group "controller" {
    task "plugin" {
      driver = "docker"

      config {
        image = "amazon/aws-ebs-csi-driver:latest"

        args = [
          "controller",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]
      }

      csi_plugin {
        id        = "aws-ebs1"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully 
      kill_timeout = "2m"
    }
  }
}

Start the CSI node Job

job "plugin-aws-ebs-nodes" {
  datacenters = ["dc1"]

  # you can run node plugins as service jobs as well, but this ensures
  # that all nodes in the DC have a copy.
  type = "system"

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image = "amazon/aws-ebs-csi-driver:latest"

        args = [
          "node",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

      csi_plugin {
        id        = "aws-ebs1"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully 
      kill_timeout = "2m"
    }
  }
}

Register an EBS volume in the same region/az, I followed the steps here with us-west-2b-a as my volume id/name:
https://learn.hashicorp.com/nomad/stateful-workloads/csi-volumes
Register the mysql job:

job "mysql-server" {
  datacenters = ["sandbox"]
  type        = "service"
  group "mysql-server" {
    constraint {
        attribute = "${node.class}"
      operator  = "="
      value     = "experiment"
    }
    # filtering task group to the AZ our EBS volume is located in
    constraint {
        attribute = "${attr.platform.aws.placement.availability-zone}"
      operator  = "="
      value     = "us-west-2b"
    }
    count = 1
    volume "mysql" {
      type      = "csi"
      read_only = false
      source    = "us-west-2b-a"
    }
    restart {
      attempts = 10
      interval = "5m"
      delay    = "25s"
      mode     = "delay"
    }
    task "mysql-server" {
      driver = "docker"
      volume_mount {
        volume      = "mysql"
        destination = "/srv"
        read_only   = false
      }
      env = {
        "MYSQL_ROOT_PASSWORD" = "password"
      }
      config {
        image = "hashicorp/mysql-portworx-demo:latest"
        args = ["--datadir", "/srv/mysql"]
        port_map {
          db = 3306
        }
      }
      resources {
        cpu    = 500
        memory = 1024
        network {
          port "db" {
            static = 3306
          }
        }
      }
      service {
        name = "mysql-server"
        port = "db"
        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Mark the current client that the job w/ CSI volume is running on as ineligible and drain the node with a 3m deadline and system jobs. The node should have logs something like this when complete:

May 25, '20 19:28:06 -0700      CSI     failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
May 25, '20 19:27:51 -0700      Storage Unmount volume
May 25, '20 19:27:46 -0700      CSI     failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
May 25, '20 19:27:46 -0700      Drain   Node drain complete
May 25, '20 19:24:31 -0700      CSI     failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
May 25, '20 19:24:29 -0700      Drain   Node drain strategy set
May 25, '20 19:23:58 -0700      Cluster Node marked as ineligible for scheduling
May 25, '20 19:22:45 -0700      Storage Mount volume

When the job is rescheduled to a new node you should be able to observe in the csi-ebs-controller logs the above scenario of the ControllerPublishVolume and ControllerUnpublishVolume are being called and stepping on each other. However it might take some tries as each time there can be different results from what I've found. Sometimes it will self resolve if the unpublish is called early enough or if the volume is successfully detached when the node is draining.

Nomad Task Event Log

Setup Failure failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: rpc error: controller publish: attach volume: rpc error: controller attach volume: rpc error: code = AlreadyExists desc = Resource already exists
 Received Task received by client

themdrain themstorage typbug

Source

kainoaseto

👍2

All 6 comments

Hi @kainoaseto! I'm looking into this!

tgross on 22 Jun 2020

After doing a little bit more testing, it looks like the other failure mode we can see here is that the volumewatcher runs asynchronously, so it can try to NodeUnpublish _after_ the node plugin has been removed from the node, which results in an error like:

2020-06-23T13:08:37.919Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=ebs-vol0 error="1 error occurred:
        * could not detach from node: rpc error: node detach volume: plugin aws-ebs0 for type csi-node not found

"

tgross on 23 Jun 2020

@kainoaseto I'm investigating how to fix this, but in the meantime I do want to give you what I think will work as a workaround for this class of problem.

Run multiple controller plugins, spread across the cluster such that a node that drains always has a controller to talk to.
Run the node plugins as system jobs, as you're doing
Use the -ignore-system flag while node draining to ensure that the node plugin stays on the host even after it's done draining so as to avoid the race with the volumewatcher.

tgross on 23 Jun 2020

Hi @tgross Thank you for looking into this! Ah that's interesting, I believe I've ran into that one at some point in time with mixed results on reproducing it consistently. Interested to how that one will be resolved, I'm guessing some type of synchronous call or wait for all the volumes to be unpublished before completing the drain would need to happen.

Thank you for the work around! I will test that as a possible solution in the meantime. One problem that I might run into still, is that the nomad clients that are used for this cluster are spot instances (provisioned through spot.io/spotinst). We get about a 5 minute drain time before the instance gets terminated but I believe that forces system jobs to drain off as well which is another way I've been able to reproduce this issue. I'll see if it's possible to modify the drain to ignore system jobs for spot to help facilitate this fix but I'm not sure if that's possible.

The controllers I've tried running multiple of but if one of those gets shutdown (maybe the node it was on drained it to start up on a new node) I've ran into this bug: https://github.com/hashicorp/nomad/issues/8034 Although it looks like there is a workaround with restarting the controllers if they get into a bad state.

kainoaseto on 23 Jun 2020

There's a status update on this issue here: https://github.com/hashicorp/nomad/issues/8232#issuecomment-670679576

tgross on 7 Aug 2020

🎉1

For sake of our planning, I'm going to close this issue and we'll continue track the progress in https://github.com/hashicorp/nomad/issues/8232.

tgross on 10 Aug 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings