We currently have an auto-upgrade mechanism which runs nixos-rebuild switch --upgrade
on a systemd timer.
I think it'd be a massive selling point for NixOS if we could do safe auto-upgrades. Namely, give users an option to add in safety checks (e.g., is the httpd returning the correct response to this test request, etc.), which NixOS tries to run after an upgrade, and if any of them fail, it rolls back.
IMO, the lack of this functionality is why auto-upgrading is not a default for Linux boxes in the real world, and NixOS's ability to do clean rollbacks is what can make it possible.
I must admit I'm not particularly hooked - doesn't you monitoring environment already look at this?
@peterhoeg So you think it is better to include some modules that help the monitoring distinguish failures related to upgrades and also start an automatic rollback when an upgrade causes a problem?
I guess in the ideal world some of the checks could be done before finalising the upgrade…
First of all, I'm not at all opposed to the idea of "safe upgrades" (once we agree on what that actually means).
If you care enough about something to want to define checks to deal with upgrades, you probably already care enough to put other things in place to ensure it stays up - the upgrade point is really only one in many.
Maybe you'd prefer to have a hydra instance that does the extra tests in VM, and you only auto-upgrade after those checks pass. Rollbacks seem more suitable for emergency situations that weren't handled automatically. _EDIT: oh, NixOS is being used in Akamai :-)_
@peterhoeg As you say, there are various changes to a system which could result in something become unhealthy, and usually "monitoring tools" are the major component in a general control system that is responsible for repairing things (which usually involves a human in the control loop manually responding).
But that doesn't seem like a convincing argument as to why NixOS itself shouldn't have its own control system to keep machines healthy. Its "health check" could be a matter of asking some external service for health information, or we could configure that information per-service, or ..., but given that NixOS can initiate upgrades, it'd also be nice for it to check that they succeeded.
Giving people actuators but then expecting them to build out the rest of the safety control loop seems like a missed opportunity for a safer system.
I'd love to see auto-upgrades. It's one of the reasons why I'm still running CoreOS Container Linux on most of my production machines.
In regards to the health checks, I'd differentiate between system services and user defined applications. I don't think it is common to manually monitor most system services - it's out of scope for most companies. If we're talking about health checks for auto upgrades, I'd expect the OS to check if it can still boot and if it can still bring up my apps.
@pierrebeaucamp: as written in the OP, there are auto-upgrades already. system.autoUpgrade.enable = true;
(Perhaps I misunderstood you.)
I'm aware of that option, but I don't know how safe it would be to run on some production servers (yet alone how well it would play with nixops). I just wanted to express that a safe and built-in upgrade mechanism would be welcomed. I think OP summed it up pretty nicely:
IMO, the lack of this functionality is why auto-upgrading is not a default for Linux boxes in the real world, and NixOS's ability to do clean rollbacks is what can make it possible.
Maybe I should have been more clear in my comment, sorry about that.
would be nice if we could easily define a script or command to be run at the end of the autoupgrade. and if it exits anything but 0 we roll back
This "Automatic roll back" post hasn't been mentioned yet: https://groups.google.com/forum/#!topic/nix-devel/Th-544aQ8Jk
Maybe a first step could be having an option when doing a nixos-rebuild
that enables rollback unless confirmed by the user afterwards. This is especially useful when ssh access is lost after a bad rebuild. Also, this rollback should be reboot persistent (i.e. rollback after reboot if confirmation is not given).
To be clearer, this could be a scenario like this:
nixos-rebuild switch --auto-rollback-in 5m
reboot # optionnal
nixos-rollback --abort
Such features exists in the Cisco iOS software:
configure terminal revert time idle 5 ! rollback if no command is entered during 5 minutes => e.g. ssh access is lost
<configure commands>
exit
configure confirm ! confirm modifications (abort rollback)
It wouldn't be too hard or controversial to add an --auto-rollback
option to nixos-rebuild if any of the systemd units are failing to start, and then add that as an option to the autoUpgrade module.
Then if you want anything more complicated, add the to checks to the specific units postStart or create a new unit with your checks.
Note that the rollback might be failing as well so it isn't entirely fool-proof.
@zimbatm That seems like a great way to implement this!
A true safe upgrade would fully virtualize everything. For example, if a ZFS is upgraded I would like a VM started that tests whether running a program that reads every file on the file system runs correctly. Unless the outcome of this issue is something like that, I don't see the point.
Similarly for changes like the change of naming schemes of networking devices.
If whatever solution you come up with doesn't account for these issues, I would consider it a useless feature.
one can probably add nixos tests to system.extraDependencies.
Then it is "only" a matter of writing nixos tests representative of a specific workflow.
Would this also work for https://github.com/NixOS/nixpkgs/issues/52644?
In that case, only when firmware is loaded (not sure whether that's only at boot), it can be determined whether a new configuration works.
Also, by the time a device has lost connectivity, you actually need to physically go to the device in certain environments. Remote management is not available nor practical in all environments.
I'd like to mention another scenario here. On Debian I can modify my apache configuration files and then run a quick apachectl configtest
to ensure I haven't made any mistakes before running systemctl reload apache2.service
. This process is not as simple on NixOS (and straight up painful on NixOps) in that I have to make my configuration change (for example: services.httpd.extraConfig = "bad config which will break apache";
), run nixos-rebuild build
, get the name of the new apache configuration file, then run apachectl configtest new-config-filename
and hope the changes to my system aren't dramatic enough to skew the result of apachectl.
So can the definition of a "safe" upgrade include simple file configuration change checks? Obviously it would be very valuable to prevent a system from switching to a new generation automatically if we can determine the new generation is not valid.
One way is to move as much checks as possible at build time. I made a preliminary PR for the httpd one ^^
Thank you for your contributions.
This has been automatically marked as stale because it has had no activity for 180 days.
If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.
Here are suggestions that might help resolve this more quickly:
Definitely still interested in this, but it requires someone's architecting time.
Most helpful comment
It wouldn't be too hard or controversial to add an
--auto-rollback
option to nixos-rebuild if any of the systemd units are failing to start, and then add that as an option to the autoUpgrade module.Then if you want anything more complicated, add the to checks to the specific units postStart or create a new unit with your checks.
Note that the rollback might be failing as well so it isn't entirely fool-proof.