Describe the bug
On switching, NixOS stops the nix-daemon, then parts in the "nix" snippet of the activation script fail, then it starts the nix-daemon again.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
nix-daemon updates are handled in a more graceful fashion.
Metadata
- system: `"x86_64-linux"`
- host os: `Linux 5.3.7, NixOS, 20.03.git.64eab81 (Markhor)`
- multi-user?: `yes`
- sandbox: `yes`
- version: `nix-env (Nix) 2.3.1`
Output:
stopping the following units: NetworkManager-wait-online.service, NetworkManager.service, accounts-daemon.service, alsa-store.service, audit.service, avahi-daemon.service, avahi-daemon.socket, bluetooth.service, colord.service, cups-browsed.service, cups.service, cups.socket, docker-prune.timer, home-manager-flokli.service, kmod-static-nodes.service, network-link-vboxnet0.service, network-local-commands.service, nix-daemon.service, nix-daemon.socket, nix-gc.timer, nscd.service, powertop.service, rngd.service, rtkit-daemon.service, systemd-binfmt.service, systemd-machined.service, systemd-modules-load.service, systemd-networkd-wait-online.service, systemd-networkd.service, systemd-resolved.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, systemd-udevd.service, tlp.service, udisks2.service, upower.service, vboxnet0.service, wpa_supplicant.service
NOT restarting the following changed units: display-manager.service, [email protected], libvirt-guests.service, libvirtd.service, systemd-backlight@backlight:intel_backlight.service, systemd-backlight@leds:dell::kbd_backlight.service, systemd-fsck@dev-disk-by\x2duuid-027E\x2d4751.service, systemd-journal-flush.service, systemd-logind.service, systemd-random-seed.service, systemd-remount-fs.service, systemd-tmpfiles-setup.service, systemd-udev-settle.service, systemd-update-utmp.service, systemd-user-sessions.service, [email protected], [email protected]
activating the configuration...
setting up /etc...
error: cannot connect to daemon at '/nix/var/nix/daemon-socket/socket': Connection refused
Activation script snippet 'nix' failed (1)
restarting systemd...
reloading user units for flokli...
setting up tmpfiles
reloading the following units: dbus.service, dev-hugepages.mount, dev-mqueue.mount, sys-fs-fuse-connections.mount, sys-kernel-debug.mount, tmp.mount
restarting the following units: polkit.service, sshd.service, systemd-journald.service
starting the following units: NetworkManager-wait-online.service, NetworkManager.service, accounts-daemon.service, alsa-store.service, audit.service, avahi-daemon.socket, bluetooth.service, colord.service, cups-browsed.service, cups.socket, docker-prune.timer, home-manager-flokli.service, kmod-static-nodes.service, network-link-vboxnet0.service, network-local-commands.service, nix-daemon.socket, nix-gc.timer, nscd.service, powertop.service, rngd.service, rtkit-daemon.service, systemd-binfmt.service, systemd-machined.service, systemd-modules-load.service, systemd-networkd-wait-online.service, systemd-networkd.service, systemd-resolved.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, tlp.service, udisks2.service, upower.service, vboxnet0.service, wpa_supplicant.service
the following new units were started: docker.service, docker.socket, var-lib-docker-btrfs.mount
warning: error(s) occurred while switching to the new configuration
If I reboot after this error, will the system be ok then? Is it just that the switching of the running system somehow doesn't work properly but if you reboot afterwards, it works fine? Or is there something broken in the system if I see this error and rebooting doesn't help?
I found that this error usually goes away after re-running the rebuild command.
This now also affects 19.09 (stable release). I think https://github.com/NixOS/nixpkgs/pull/76785 is the cause.
"x86_64-linux"
Linux 5.3.13, NixOS, 19.09.1748.ad1e1af5ad3 (Loris)
yes
yes
nix-env (Nix) 2.3.1
"nixos-19.09.1748.ad1e1af5ad3, nixos-unstable-20.03pre202088.e89b21504f3"
"home-manager-19.09"
/nix/var/nix/profiles/per-user/root/channels/nixos
I think it happens during large upgrades during which nix is upgraded. It would make sense to not restart nix mid-script but rather do it at the end, if that's what's causing it.
the ${nix}/bin/nix ping-store --no-net
within the activation script should probably be changed over to:
${nix}/bin/nix ping-store --no-net --store local
that tells nix to just open /nix
directly, rather then reaching out to a nix-daemon
to get things done
@cleverca22 this will still fail parts of activation if you restart services somehow interacting with the nix store. An alternative would be to restart the nix daemon if it has changed before doing that for all other units.
Do we have a fix somewhere ? it is happening in prod and breaking our deploys on 19.09.
I am happy to write a PR, just not sure to understand what to touch
@DianaOlympos I assume nixos/modules/system/activation/switch-to-configuration.pl
needs to be updated to restart nix-daemon.service
(if necessary), then restart the rest of the services.
Oh my, i will really write some perl. Not sure i am the best for this one :smile: Especially the switch, i had problems reading it before. Ok will try to have a look, if noone else can.
It does affect 19.09
so i think it is even worse.
This will stop the previously running nix-daemon.service
but the activation phase needs it. So i am not 100% sure of what to do here ? I can't restart the daemon because i am not activated yet... no ? Or should we filter the nix-daemon
out of the stop list and then restart it at the end ? but then we may have a nix
version that is not the one used by the nix-daemon
.
Or i missed something
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/help-wanted-updating-nix-as-part-of-nixos-rebuild-switch/5785/1
Facing same issue on 19.09 :-(
Still no fix/crunch?
https://discourse.nixos.org/t/help-wanted-updating-nix-as-part-of-nixos-rebuild-switch/5785/2?u=dianaolympos
This is the best we have from @mkg20001 but i do not have the time to do it right now nor the brain power.
If someone want to do it though and to push it into 20.03 it would be nice. It will not solve the problem we face when going from 19.09 as released (AMI) into 19.09 current stable, but at least it would provide a path forward.
I actually tried to disable both services restarting/reloading, did not help (I actually use NixOps to deploy 19.09 to AWS EC2 running NixOS 19.09 AMI)
` /* systemd.services.nix = {
reloadIfChanged = false;
restartIfChanged = false;
stopIfChanged = false;
};
systemd.services.nix-daemon = {
reloadIfChanged = false;
restartIfChanged = false;
stopIfChanged = false;
}; */
`
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/spurious-errors-while-rebuilding/4782/2
Fixed in #87182.
IIUC, this fix is only on master, not on the 20.03 branch? Just want to confirm since I'm still seeing this issue on systems running 20.03.
I've just submitted https://github.com/NixOS/nixpkgs/pull/89191 which hopefully backports the relevant fixes from #87182, so as to fix this on 20.03 without breaking backwards-compatibility on the API of nixos-install.
Feel free to test and confirm this works as intended! Reopening as missing backport for the time being :)
Most helpful comment
Fixed in #87182.