Reading the documentation for NOMAD_ALLOC_INDEX at https://www.nomadproject.io/docs/runtime/environment/ states
Allocation index; useful to distinguish instances of task groups. From 0 to (count - 1). The index is unique within a given version of a job, but canaries or failed tasks in a deployment may reuse the index.
However it does not make any statement about what happens when a job is updated from version N to N+1, or migrated. Without these guarantees, it's not clear if this can be used as an index when creating stateful, HA jobs. For example, given three EBS volumes mysql-0, mysql-1, mysql-2 that can be mounted via the CSI plugins, it's not clear if an upgrade operation will always stop alloc-0 at version N and replace it with alloc-0 at version N+1 (which would allow EBS volume mysql-0 to be remounted on the new alloc), or if it could stop alloc-2 at version N and replace it with alloc-0 at version N+1 (which would be unable to schedule since mysql-0 was already mounted by the still running alloc-0 version N job).
Two questions:
1) Is this env var intended to be used in this way? If not, is there a suggested way to identify stateful constraints like this? One idea is to use dedicated machines that have pre-existing knowledge of their shard/replica index via userdata, but the obvious downside is that nomad no longer has any flexibility in scheduling these jobs.
2) Can the documentation be updated to include the guarantees (or lack thereof) about how this env var behaves with respect to updates and migrations?
Unfortunately, there are no guarantees that indexes will be unique: https://github.com/hashicorp/nomad/issues/6829#issuecomment-564035797, though it would be useful and convenient.
Hi @djmarcin!
Is this env var intended to be used in this way? If not, is there a suggested way to identify stateful constraints like this?
No it's not. As I mentioned in the comment that @pznamensky linked to:
If you need a unique number for interpolating a value in your application, you can get this by combining the job version and the alloc index (which together should be unique).
For your specific use case with multiple EBS volumes where you need to uniquely identify each volume, you might want to consider not having all three tasks in the same task group. That way you'd have a 1:1 pairing between the alloc and the volume that goes with it.
However it does not make any statement about what happens when a job is updated from version N to N+1, or migrated.
...
Can the documentation be updated to include the guarantees (or lack thereof) about how this env var behaves with respect to updates and migrations?
What this comes down to is whether the updates are handled in order, but there aren't any guarantees on this ordering. (The reason for that is that while the final allocation update placements are linearized, Nomad handles each update as a separate eval and processes those evals concurrently.)
This definitely seems like a good place for documentation improvements.
Also, it looks like I never directly answered @pznamensky's follow-up question in that thread (馃槉sorry!), which is "what's the point of NOMAD_ALLOC_INDEX then?" Note that it's unique when combined with the job version, but the job version by itself wouldn't be useful enough to identify the alloc index. So rather than having a monotonically increasing alloc index across all job versions, the operator can decide whether they care about allocs within a job version (NOMAD_ALLOC_INDEX) or allocs across all job versions (NOMAD_JOB_VERSION + NOMAD_ALLOC_INDEX), without Nomad having to keep track of two different monotonically increasing indexes.
Thanks for the detailed reply, @tgross, that all makes sense.
For your specific use case with multiple EBS volumes where you need to uniquely identify each volume, you might want to consider not having all three tasks in the same task group. That way you'd have a 1:1 pairing between the alloc and the volume that goes with it.
Are you saying that you would consider 3 separate groups within a single job, or three separate jobs? As far as I currently understand, both of those have serious problems.
The problem with 3 groups within a single job is that updates execute in parallel, which would take down all replicas of the database simultaneously.
On the other hand, three separate jobs would mean orchestrating the deployment of these jobs outside nomad, and loss of the ability to ensure constraints like distinct host.
I appreciate that stateful HA services are a niche issue with a lot of service specific edge cases, though. Dedicated machines with static metadata seems like a viable workaround for now. I'll leave the issue open for docs follow-up.
The problem with 3 groups within a single job is that updates execute in parallel, which would take down all replicas of the database simultaneously.
That'd been what I was thinking, but that's a good point. If they're all in one group the update stanza gives you most of the features you want here, just not the pinning a task to a specific volume.
On the other hand, three separate jobs would mean orchestrating the deployment of these jobs outside nomad, and loss of the ability to ensure constraints like distinct host.
Right. That's not great either.
I appreciate that stateful HA services are a niche issue with a lot of service specific edge cases, though.
True, but it's one we definitely want to support! I think the interaction between multiple volumes and updates is something that we've under-designed. I've written up a new issue for the team to explore this further: https://github.com/hashicorp/nomad/issues/8058
Most helpful comment
Hi @djmarcin!
No it's not. As I mentioned in the comment that @pznamensky linked to:
For your specific use case with multiple EBS volumes where you need to uniquely identify each volume, you might want to consider not having all three tasks in the same task group. That way you'd have a 1:1 pairing between the alloc and the volume that goes with it.
What this comes down to is whether the updates are handled in order, but there aren't any guarantees on this ordering. (The reason for that is that while the final allocation update placements are linearized, Nomad handles each update as a separate eval and processes those evals concurrently.)
This definitely seems like a good place for documentation improvements.
Also, it looks like I never directly answered @pznamensky's follow-up question in that thread (馃槉sorry!), which is "what's the point of
NOMAD_ALLOC_INDEXthen?" Note that it's unique when combined with the job version, but the job version by itself wouldn't be useful enough to identify the alloc index. So rather than having a monotonically increasing alloc index across all job versions, the operator can decide whether they care about allocs within a job version (NOMAD_ALLOC_INDEX) or allocs across all job versions (NOMAD_JOB_VERSION+NOMAD_ALLOC_INDEX), without Nomad having to keep track of two different monotonically increasing indexes.