Nomad: [question] How to force CSI Node and Plugin jobs start before any other workload

Created on 30 Sep 2020 · 5Comments · Source: hashicorp/nomad

How to force CSI Node and Plugin jobs start before any other workload ?

After an infrastructure restart, workloads are being scheduled into nodes that have not yet the CSI Node task running as it takes some time to download and start.

Is there any way to force the other jobs wait for CSI nodes and controller ?
Can we force the plugin container image to get cached and not downloaded every time?
Can create a task on server level to start node and controller on the OS an then register the CSI plugin on to the sock address ?
Can we use the docker engine to do this ? and not rely on nomad to start this "prerequisites"

We get a lot of errors related to exhausted write claims also . If node and controller plugins are running ok before all tasks, all is good but if It fails or takes too long to start, everything start to get pretty unstable.

We are using azuredisk CSI Driver , mounting azure managed disks to nomad clients.

themstorage typquestion

Source

carlosrbcunha

All 5 comments

After an infrastructure restart, workloads are being scheduled into nodes that have not yet the CSI Node task running as it takes some time to download and start.

If the CSI Node task isn't running, then it won't have fingerprinted with the servers and should not be showing up as a valid scheduling target. Are the plugins being shutdown cleanly so that they deregister themselves with the servers?

Can we force the plugin container image to get cached and not downloaded every time?

In your plugin configuration on the clients you can either:

Set docker-cleanup-image to false. This would require that you have some other process running on the host to clean up workload images.
Set docker-cleanup-image-delay. This would allow you to restart infrastructure without

Can create a task on server level to start node and controller on the OS an then register the CSI plugin on to the sock address ?

See my first answer above. The CSI plugin doesn't get registered until after the plugin container is running.

Can we use the docker engine to do this ? and not rely on nomad to start this "prerequisites"

Unfortunately that's not going to work. The directory with the CSI control socket needs to be within the Nomad allocation directory that gets created by Nomad just before the container gets built.

tgross on 30 Sep 2020

Nice insight , thanks.
Just some related questions.
Here are our jobs, I put them together with some input of gitter conversations and web content. Please review to see if its all ok.

Controller job

job "plugin-azure-disk-controller" {
  datacenters = ["dc1"]
  type = "service"

  vault {
    policies = ["nomad-jobs"]
  }

  group "controller" {
    count = 1

    # disable deployments
    update {
      max_parallel = 0
    }
    task "controller" {
      driver = "docker"

      template {
        change_mode = "noop"
        destination = "local/azure.json"
        data = <<EOH
{
"cloud":"AzurePublicCloud",
"tenantId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_TENANT_ID}}{{end}}",
"subscriptionId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_SUBSCRIPTION_ID}}{{end}}",
"aadClientId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_CLIENT_ID}}{{end}}",
"aadClientSecret": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_CLIENT_SECRET}}{{end}}",
"resourceGroup": "resource-group-name",
"location": "westeurope"
}
EOH
      }

      env {
        AZURE_CREDENTIAL_FILE = "/etc/kubernetes/azure.json"
      }

      config {
        image = "mcr.microsoft.com/k8s/csi/azuredisk-csi:v0.9.0"

        volumes = [
          "local/azure.json:/etc/kubernetes/azure.json"
        ]

        args = [
          "--nodeid=${attr.unique.hostname}-vm",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]
      }

      csi_plugin {
        id        = "az-disk0"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully
      kill_timeout = "2m"
    }
  }
}

Node job
``` job "plugin-azure-disk-nodes" {
datacenters = ["dc1"]

vault {
policies = ["nomad-jobs"]
}

# you can run node plugins as service jobs as well, but this ensures
# that all nodes in the DC have a copy.
type = "system"

group "nodes" {
task "node" {
driver = "docker"

  template {
    change_mode = "noop"
    destination = "local/azure.json"
    data = <<EOH

{
"cloud":"AzurePublicCloud",
"tenantId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_TENANT_ID}}{{end}}",
"subscriptionId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_SUBSCRIPTION_ID}}{{end}}",
"aadClientId": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_CLIENT_ID}}{{end}}",
"aadClientSecret": "{{with secret "kv/azure/credentials"}}{{.Data.ARM_CLIENT_SECRET}}{{end}}",
"resourceGroup": "resource-group-name",
"location": "westeurope"
}
EOH
}

  env {
    AZURE_CREDENTIAL_FILE = "/etc/kubernetes/azure.json"
  }

  config {
    image = "mcr.microsoft.com/k8s/csi/azuredisk-csi:v0.9.0"

    volumes = [
      "local/azure.json:/etc/kubernetes/azure.json"
    ]

    args = [
      "--nodeid=${attr.unique.hostname}-vm",
      "--endpoint=unix://csi/csi.sock",
      "--logtostderr",
      "--v=5",
    ]

    # node plugins must run as privileged jobs because they
    # mount disks to the host
    privileged = true
  }

  csi_plugin {
    id        = "az-disk0"
    type      = "node"
    mount_dir = "/csi"
  }

  resources {
    memory = 256
  }

  # ensuring the plugin has time to shut down gracefully
  kill_timeout = "2m"
}

}
}
```
We will try to implement a controlled shutdown with a drain without forcing system job out and after that stop these jobs, will that, in your opinion solve the plugin de-registration issue ?
We are using nomad for development clusters with dev workloads that are shutdown at the end of the day. On the next morning, with CSI, all is down.. We are trying to get this stable. Any thoughts?

Sometimes the plugin stats get all strange with odd numbers like having more nodes expected than nodes present.

Regarding Azuredisk , you still do not have any documentation. It is very frustrating only seeing documentation for k8s and trying to port everything to nomad. Please add more examples..

carlosrbcunha on 30 Sep 2020

Please review to see if its all ok.

I don't have an Azure environment handy to verify them, but those jobs look reasonable. Depending on the size of your deployment, I'd consider running multiple controllers spread across hosts so that you're less likely to have one of them on a node that you're in the middle of draining.

We will try to implement a controlled shutdown with a drain without forcing system job out and after that stop these jobs, will that, in your opinion solve the plugin de-registration issue ?

It should. See this note in the csi_plugin docs:

Note: During node drains, jobs that claim volumes must be moved before the node or monolith plugin for those volumes. You should run node or monolith plugins as system jobs and use the -ignore-system flag on nomad node drain to ensure that the plugins are running while the node is being drained.

For your dev environment:

We are using nomad for development clusters with dev workloads that are shutdown at the end of the day. On the next morning, with CSI, all is down.. We are trying to get this stable. Any thoughts?

When the servers come up (or if they're left running), they won't have a heartbeat from the client agents in however many hours and will consider them lost. And that'll result in them trying to reschedule all the workloads, including the CSI plugins. If you want to shutdown the whole cluster you'll definitely need to shutdown all the volume-consuming workloads so that volume claims get released. You should avoid https://github.com/hashicorp/nomad/issues/8609 because you're shutting down rather than draining off to another client.

That being said, taking off my Nomad developer hat for a moment, if you're running on cloud infrastructure anyways it might be worth your while to automate recreating the developer environment from scratch every day. I've done that sort of thing to prevent developers from leaving all kinds of undocumented things laying around development environments that later bite you when you get to production.

Sometimes the plugin stats get all strange with odd numbers like having more nodes expected than nodes present.

We have a couple of open bugs on that (https://github.com/hashicorp/nomad/issues/8948, https://github.com/hashicorp/nomad/issues/8628, https://github.com/hashicorp/nomad/issues/8034). There's also some eventual consistency because it takes a while for plugins to come live: https://github.com/hashicorp/nomad/issues/7296. I've got some time coming up in the next couple weeks to work on some of those items.

Regarding Azuredisk , you still do not have any documentation. It is very frustrating only seeing documentation for k8s and trying to port everything to nomad. Please add more examples..

Please keep in mind that Nomad's CSI integration is still marked as beta, so you'll need to do a bit of lift here. There are ~100 or different CSI drivers, so we won't be able to provide documentation for them all. There's been a few other folks who've given it a go, if you search through issues for storage and Azure

tgross on 30 Sep 2020

👍1

As per our research we implemented some guardrails on our pipeline to stop the jobs and check if the disks are detached normally. Having a controlled shutdown and startup did the trick.

My colleagues and I will continue to ask stuff and do code or documentation contributions so that it gets easier for the next guy.
Funny thing.. the requests you pointed out (storage and azure) are mine ;)

Thanks for all your help, you are doing a great job.

carlosrbcunha on 1 Oct 2020

👍1