As of 0.7.0 deployments - update stanzas with auto_revert, canary, etc - are only implemented for service type jobs. We need to implement auto reverts and other reconciliation features for system jobs. There should be a way to stop a rolling upgrade of a bad system job across the fleet.
/me clicks on issue linked by co-worker, reads issue, sees @schmichael in the activity feed, strokes beard while nodding with approval
馃憢
I'm so bored with non-transparent updates for system jobs. Really needs this feature.
@preetapan @schmichael @dadgar this is something I really want to see and am happy to have a crack at it unless its already being worked on internally? If you're not, any thoughts, ideas or tips would be greatly appreciated.
@jrasell We want to make several improvements to the system scheduler including implementing deployments, as well as bringing in other improvements that are in the reconciler. This is a fairly large scoped project and implementing this will involve a set of non trivial changes. We are currently targeting this for a future release, likely after Nomad 0.9.0
@preetapan any update on a timeline for this? I specifically am looking for canary support for system jobs.
@preetapan we've just launched into the world of Nomad, and found this issue when deploying our first system-level job to our cluster. Any update on when we can expect healthchecking for system jobs?
@preetapan I see a new 0.10 nomad was recently released. Any updates on this feature?
@jrasell will you take this in your hands now? ;)
We'd also love to see this functionality, so +1 from our end! 馃憤
Any updates about status for this functionality?
Or maybe it's already done but only for enterprise version?
Or maybe it's already done but only for enterprise version?
Nope, it will be OSS!
This _is_ roadmapped. As everyone can probably guess there's _a lot_ going on in everyone's lives, so a timeline has been very tricky. We're very excited to see the initial PR #8841 from @dubadub and hope to have someone dig into it with them. There are some very tricky aspects to deployments for system jobs that we need to be right to maximize usability and minimize complexity.
For example if we spun up canaries concurrently with the stable version's allocation on the same node, there would likely be resource conflicts (static ports, host volumes) that block placement or prevent proper functioning. Therefore it seems like system deployments should diverge from service deployments in that canaries should act as _replacements_ instead of _additional capacity._
To further complicate matters: as @dubadub discovered in #8841 the code in question could use some refactoring. The layout of Nomad's scheduler has basically never changed, so as you can imagine there are some opportunities for cleanup.
So please keep the use cases coming! The more detailed you can be about the desired behaviors the better! I know it seems like we're silent sometimes, but we definitely parse and discuss and rehash every word of Github comments to ensure we're meeting the desired use case.
Most helpful comment
We'd also love to see this functionality, so +1 from our end! 馃憤