Nixpkgs: Request for fail-safe rollback feature

Created on 27 Jul 2019  路  13Comments  路  Source: NixOS/nixpkgs

I would be happy if there was some official/upstreamed functionality for timed automatic rollback - e.g. if an ssh connection fails due to misconfiguration. This would mesh well with NixOS's friendlyness towards "safe changes" - i.e. you can "undo" things.

I did't find a separate issue for this yet, it's pretty much covered by the discussion in https://github.com/NixOS/nixpkgs/issues/34902 but may be a simpler starting point.

Highlights:
https://github.com/NixOS/nixpkgs/issues/34902#issuecomment-366294622
https://github.com/NixOS/nixpkgs/issues/34902#issuecomment-438337876

This would probably be covered by existing rollback scripts, I just want something officially endorsed. My impression is other people also want this it just hasn't been done.

enhancement nixos

Most helpful comment

The user could declare a healthcheck process that gets executed on boot. It could just be sleep 300 && curl -sfL http://localhost:8080 for example. If the healthcheck fails the system would rollback.

That type of approach is used in Kubernetes to determine if a container is healthy or needs to be replaced.

All 13 comments

Systemd recently got support for automatic rollbacks when certain conditions aren't met. Perhaps we can build on top of that

I'd be fine for the initial implementation to just require the user to run a command to deactivate the reset.

Boot related things seem more relevant to https://github.com/NixOS/nixpkgs/issues/26332 ? (Maybe needs a per-bootloader aggregation issue). Interesting in any case though.

Or am I misunderstanding the intentions?

Or perhaps I was unclear? I meant for nixos-rebuild switch, not for reboots.

The first step is to define criteria of when a switch is successful and when it is not. Probably not a difficult task, but needs some consideration.

i'm imagining it with the user deciding if it was successful by turning off the reset timer instance.

The user could declare a healthcheck process that gets executed on boot. It could just be sleep 300 && curl -sfL http://localhost:8080 for example. If the healthcheck fails the system would rollback.

That type of approach is used in Kubernetes to determine if a container is healthy or needs to be replaced.

I was thinking you were intending something that could actually tell right away if the switch was successful or not. My mistake.

https://www.youtube.com/watch?v=J4DgATIjx9E talk by @basvandijk on how LumiGuide does this.

@lheckemann Yeah, I wrote that a while ago. One of the main benefits is that the condition of whether a switch was "good" is made by humans. A human has to tell the system that the upgrade was successful, otherwise it will automatically rollback after 20 minutes. This has saved us from being locked out of our systems quite a few times.

Some serious downsides to our system have prevented us from releasing it properly:

  • Specific to our systems, requires some effort to migrate code to a more generic implementation
  • It depends on nixos switch version numbers, which sometimes go funny ways when you're going back and forth between versions
  • Rollback is done with nixos-rebuild switch --rollback. It would be more stable to switch using nixos-rebuild test and just reboot after 20 minutes, since switching configurations puts the device in a state "between" configurations. By that I mean services that aren't restarted, and a very specific Hetzner issue where switching would just kill the network under some circumstances. We had had that problem for quite a while, but it went away for us. I think @nh2 might still suffer from this problem.

https://github.com/Infinisil/nixus looks like it has some sort of related functionality.

That looks pretty clean actually, better than the systems services and timers stuff I had written 馃槄

Was this page helpful?
0 / 5 - 0 ratings