Per a topic on Discuss - I've ran into a few situations in our setup where it would be absolutely beneficial to use variable interpolation in the volume and host_volume stanzas. As an example, we currently set up various cloud instances with their own mounted volumes that belong to a particular instance. On these instances we like to run system jobs, so that every instance has one - but the problem now is that currently there is no easy way to automatically mount the volume that "belongs" to a particular node.
In essence, being able to do something like...
volume "persistent-data" {
type = "csi"
source = "pd-${node.attr.id}"
read_only = false
}
Would be most excellent. At that point each system job picks up it's own volume automatically, which makes my life easier (and saves on job definitions, or currently the one job we use with the 40+ task groups so we can get it to work properly).
Outside the scope of this feature request (but maybe related) would be a change to host volumes - currently we have a few servers that use host volumes, but the same volume is shared between multiple joibs because we can't mount a subdirectory out of a host volume - perhaps a change to allow something like:
volume "subdirectory-mounted-host-volume" {
type = "host"
source = "my-host-volume/${env "nomad_job_name"}"
read_only = false
}
So that each job gets itself a named volume to be used with volume_mount that points to a subdirectory of the actual host volume. Currently if we want to achieve this we need to add new host volumes to the client configuration and restart the nomad process on the node, which isn't something we're very keen on. (Alternatively having new host volumes picked up with a reload, if it isn't already, would be nice too).
This would probably entail some tweaks to the order in which things are resolved and set up, but would (again) maky my life easier.
Can't speak for anyone else though so maybe I'm just the only one with this particular use case :D
Hi @benvanstaveren! Thanks for opening this issue and giving a solid description of a use case. Will get this onto the road map.
Outside the scope of this feature request (but maybe related) would be a change to host volumes - currently we have a few servers that use host volumes, but the same volume is shared between multiple joibs because we can't mount a subdirectory out of a host volume - perhaps a change to allow something like:
Yup! Looks like we've had a few requests for this feature: https://github.com/hashicorp/nomad/issues/7110 https://github.com/hashicorp/nomad/issues/6536
cross referencing https://github.com/hashicorp/nomad/issues/8262
a huge +1, as this is blocking our progress with CSI. In particular, we'd need the Nomad job environment variables e.g.
...
group "group" {
volume "group_csi_volume" {
type = "csi"
read_only = false
source = "csi-test-${NOMAD_ALLOC_INDEX}"
}
...
Waiting for this badly 👍
It would be super cool if we could simply do this with our nomad jobs:
parameterized {
meta_required = [“volume”]
}
volume “efs” {
type = “csi”
source = “$NOMAD_META_volume”
}
Wanted to follow up on this just to make sure folks don't think we're ignoring this important issue. This feature request is seems like it's a single feature on the surface but there are actually two parts to it:
Interpolating the volume block with information that's only available _after_ placement on the host. This isn't going to be easily done (if at all), because we use the volume.source at the time of scheduling in order to determine where the volume can go. So there's a chicken-and-the-egg situation where the scheduler would not have enough information to place if we allowed the volume.source to include things like allocation index or client metadata. There may be improvements we can do here, but this is non-trivial. (I'm inclined to consider that as a larger Nomad issue than this feature request, because it impacts the entire scheduler.)
Interpolating the volume block's HCL with information that can be used for scheduling. We already can do this with the HCL2 feature in Nomad 1.0!
Here's an example of a job that consumes CSI volumes. The commented-out bits are how the job originally worked with a static CSI volume, and the dynamic blocks are the replacements that let us interpolate the source with data that's available at job submission time. (We're still working on getting better documentation for all the HCL2 stuff, but we should have it good shape by the time we ship 1.0 GA).
variables {
volume_ids = ["volume0"]
}
job "example" {
datacenters = ["dc1"]
group "cache" {
dynamic "volume" {
for_each = var.volume_ids
labels = [volume.value]
content {
type = "csi"
source = "test-${volume.value}"
read_only = true
}
}
# volume "volume0" {
# type = "csi"
# source = "test-volume0"
# read_only = true
# mount_options {
# fs_type = "ext4"
# }
# }
count = 2
task "redis" {
driver = "docker"
config {
image = "redis:3.2"
port_map {
db = 6379
}
}
dynamic "volume_mount" {
for_each = var.volume_ids
content {
volume = "${volume_mount.value}"
destination = "${NOMAD_ALLOC_DIR}/${volume_mount.value}"
}
}
# volume_mount {
# volume = "volume0"
# destination = "${NOMAD_ALLOC_DIR}/test"
# }
resources {
cpu = 500
memory = 256
}
}
}
}
cc @notnoop to point out a working case of dynamic for our docs.
cc @galeep @yishan-lin so that internal/customer discussions of this feature request are using up-to-date information.
Example of using HCL2 to accomplish this is in https://github.com/hashicorp/nomad/pull/9449
Thanks a lot for the update @tgross.
This issue is still blocking adoption of CSI in our stack. A few comments:
Interpolating the volume block with information that's only available after placement on the host. This isn't going to be easily done (if at all), because we use the volume.source at the time of scheduling in order to determine where the volume can go. So there's a chicken-and-the-egg situation where the scheduler would not have enough information to place if we allowed the volume.source to include things like allocation index or client metadata. There may be improvements we can do here, but this is non-trivial. (I'm inclined to consider that as a larger Nomad issue than this feature request, because it impacts the entire scheduler.)
I certainly see that 'allocation index' is insufficient to use for this purpose. You actually just want the container to be able to interpolate 'CSI volume index' - because really, the indexing should be of the persistent objects, not at the allocation level.
So perhaps if we could specify that the 'count' of the group can used to reference volumes, through something like a 'volume group'
volume_group "redis-volumes" {
type = "csi"
source = "test-${count.index}"
read_only = true
mount_options {
fs_type = "ext4"
}
}
and then we'd do something like
task "redis" {
driver = "docker"
...
volume_mount {
volume_group = "redis-volumes"
destination = "${NOMAD_ALLOC_DIR}/test"
}
...
env {
MY_ID = "${NOMAD_VOLUME_GROUP_redis_volumes_ID}" // internally calculated, persistent per volume
#MY_ID = "${NOMAD_ALLOC_INDEX}" // not neeeded anymore.
}
How does that sound?
This issue is still blocking adoption of CSI in our stack.
Sorry to hear that. Could you dig into that a bit more? I want to make sure this isn't a matter of having written an incomplete HCL2 example instead of something we can't work around for you.
Effectively there's a gap in the job specification:
volume block can define a single volume that's used by a single allocation. Ex. AWS EBS or Azure Block Store.volume block can define a single volume that's shared between multiple allocations. Ex. an NFS volume.volume block _cannot_ currently but _should_ allow defining multiple independent volumes, each of which is mapped to one of the job's allocations.The HCL2 example in #9449 effective tries to turn case (3) into case (1).
So perhaps if we could specify that the 'count' of the group can used to reference volumes, through something like a 'volume group'
I think you're on the right track here in terms of the job spec. Arguably, volumes aren't really group-level resources once you have CSI in the mix -- they're cluster-level resources that are being consumed by the job. And then the volume block is a reference to that cluster-level resource. But we could do everything we're talking about with volume_group today in volume and still run into the issue of not having any kind of count information available for mapping to allocations until after we've created a plan.
I think this all comes down to wanting to create the count of volumes in the scheduler where the plan is being created. Adding the count into the planning stage opens up a bunch of questions around updates: what happens when we reschedule an allocation, or have canaries, or have rolling updates? Probably not intractable but there's a lot of small details we'll need to get right.
Also we haven't yet implemented Volume Creation https://github.com/hashicorp/nomad/issues/8212, which I'm just starting design work on. If we want to be able to support some kind of count/index attribute, we'll need to handle counts at volume creation time too (and I haven't figured out what stage that happens in yet).
Our process around design work like this is usually to write up an RFC which gets reviewed within the team and then engineering-wide. I'm going to push for making this something we can share with the community for this issue and #8212 so that we can get some feedback.
Thanks for that @tgross.
I'm a bit short of time, so I'll just ramble a bit.
First, lets address the use case:
1) Need for a fleet of containers with persistent storage per-container.
e.g. 3 node redis cluster of aws ECS containers (1 master 2 replicas) each with a persistent EBS volume. If one of the containers goes down, we'll need to start up a new container with the same volume mounted (so must be in the same AZ as the volume of the downed instance - CSI topology constraint).
Lets assume for now, that we have the CSI volume create functionality.
Just to abstractly use the spec I laid out before - with more detail. We need to indicate that we want 3 (volume-container) pairs that intersect a set of constraints.
First, at the group level , we set up the (volume- half of the joint constraints
group "redis-fleet" {
count = 3
volume_group "redis-volumes" {
type = "csi-ebs"
read_only = true
size = 10gb
}
Then, also at the group level, we set up the -container) half of the constraints
task "redis" {
driver = "docker"
...
volume_mount {
volume_group = "redis-volumes"
destination = "${NOMAD_ALLOC_DIR}/test"
}
...
env {
MY_ID = "${NOMAD_VOLUME_GROUP_redis_volumes_ID}" // internally calculated, persistent with the volume
}
The intersection of the two scheduling constraints is done by the 'volume_mount' stanza in the task. I.e. its saying that when we shedule a container from this task, we also need to shedule (or satisfy) the placement constraints of the volume_group simultaneously.
If there was another task in this group besides "redis" we could also attach it to the same 'volume group' and have both containers (scheduled on the same host because they are in the same 'group') access the shared volume.
++ My general understanding of the syntax elements inside the 'group' stanza, is that 'you get one of these per count' for all of the stanzas subordinate to 'group'. Which is why I feel that the way 'volume' is used now currently is confusing. I would call 'volume' as it as used now 'common_volume'. meaning that, despite what 'count' says, you are only going to get one of these. I almost feel that the default behavior should be 'volume_group' as above.
Other use cases:
2) Scaling up the count.
This is relatively straightforward. There needs to be some care about scaledown-then-scale up. i.e. do you want to use the previously initialized volumes, or do they get cleaned up.
3) Canary
There should be a way to specify if you want the canary container to use the existing persistent volume (i.e. get scheduled alongside existing deployment), or scheduled into new persistent volumes (or temporary volumes?).
Most helpful comment
Wanted to follow up on this just to make sure folks don't think we're ignoring this important issue. This feature request is seems like it's a single feature on the surface but there are actually two parts to it:
Interpolating the
volumeblock with information that's only available _after_ placement on the host. This isn't going to be easily done (if at all), because we use thevolume.sourceat the time of scheduling in order to determine where the volume can go. So there's a chicken-and-the-egg situation where the scheduler would not have enough information to place if we allowed thevolume.sourceto include things like allocation index or client metadata. There may be improvements we can do here, but this is non-trivial. (I'm inclined to consider that as a larger Nomad issue than this feature request, because it impacts the entire scheduler.)Interpolating the
volumeblock's HCL with information that can be used for scheduling. We already can do this with the HCL2 feature in Nomad 1.0!Here's an example of a job that consumes CSI volumes. The commented-out bits are how the job originally worked with a static CSI volume, and the
dynamicblocks are the replacements that let us interpolate the source with data that's available at job submission time. (We're still working on getting better documentation for all the HCL2 stuff, but we should have it good shape by the time we ship 1.0 GA).cc @notnoop to point out a working case of
dynamicfor our docs.cc @galeep @yishan-lin so that internal/customer discussions of this feature request are using up-to-date information.