Nixpkgs: Testing `nixos-rebuild switch`

Created on 3 Dec 2016 · 15Comments · Source: NixOS/nixpkgs

There are many tests in nixos to test some functionality but only a few of them test activating or deactivating that functionality. Because of that, NixOS tends to be "don't forget to reboot machine after nixos-rebuild switch"

Specifically, avahi service works, but enabling it doesn't https://github.com/NixOS/nixpkgs/issues/19034 . Fix is proposed (https://github.com/NixOS/nixpkgs/pull/20871), but I think there should be an automated way to test enabling/disabling/upgrade of a service.

I tried some time ago to write such test (for https://github.com/NixOS/nixpkgs/pull/15815), but was hit with immutable /nix/store in test machine, so I have no idea how to do that properly.

cast @kampfschlaefer for suggestions

(of course, not everything can be enabled/disabled on switch. Perhaps this can be documented as test too)

enhancement stale nixos

Source

danbst

Most helpful comment

Testing for switch to a new configuration is already done in the Taskserver test, but the stackoverflow issue could have been triggered more easily if we'd have a test from an older version to a newer one, not just stable/unstable.

I personally have a similar use-case for something like this in my deployments, where I test whether database migrations work successfully. Unfortunately, this needs a lot of imports from derivations and I usually have eval times of several minutes on these deployments :-/

So I guess testing between releases should probably end up in a dedicated jobset, which has two inputs like nixpkgs_old and nixpkgs_new, so we have a low eval time. Gong further, this could probably be automated as well, if we just save the disk state for each test in nixpkgs_old and run the same test with that disk state against nixpkgs_new.

Unfortunately, I think this won't work for more complicated tests, so I'd propose something like a snapshot command in the Perl test driver, which would save the current machine state (including a memory dump).

Having such a snapshot command would also help for speeding up boot times in tests, because all that's needed is to restore the state instead of going through the full boot. The idea of starting with a basic machine would also have that advantage.

Also, I've been thinking about making it easier to switch between system configurations, but didn't yet come up with a very good solution, but these are the ones I had in mind:

If a node isn't a function or and attribute set, but a list, use it as a series of generations and make it switchable using something like $machine->switch(10);.
Cons:
- Could end up in lots of duplicated configuration attributes.
- Using integer indices would probably make tests very hard to read, imagine 20 generations, you want to switch to generation 13 and now you need to count through all this mess.
Use a separate attribute, which in turn contains an attribute set mapping from generation name to the corresponding configuration.
Pros:
- No counting of indices anymore.
Cons:
- Still with duplicated configuration values.
Same as both options before, but use some kind of a delta DSL for transforming the configuration.
Cons:
- Complicated, because people would need to learn the DSL.
- Even more complicated, because it's not so sure which config you want to change.
(Mis-)use the module key attribute to define conditions for specific generations, for example:
nix { machine = { lib, ... }: { imports = lib.singleton { key = "disable-foo"; services.foo.enable = lib.mkForce false; }; services.foo.enable = true; }; }
So the initial generation has service foo enabled, but if you do $machine->switch('disable-foo'); the foo service is disabled.
Ditch all of these and implement these generations as ordinary module options, so you have something like config.generations which reference the configuration values to merge.
Pros:
- Easy to understand if you know the module system already.
  
  Cons:
- Very tricky to implement because we'd need to merge top-level module attributes, which could trigger an infinite recursion if we reference top-level config attributes.
- Can be very ugly if lists are involved, because you can't easily patch out single items or filter them (might however be a good idea for the module system in general as well, something like lib.filterList <prio> <fun>) so you need to mkOverride whole lists.
Similar to the previous one but instead of merging configurations, expose these options as boolean values, so the main config would look like:
nix { machine = { generation, lib, ... }: { generations = [ "enabled-foo" ]; services.foo.enable = generation.enabled-foo; }; }

Had some more ideas, but I'm currently writing-impaired (broken shoulder and typing with one hand) so I won't list all of them as I didn't come up with a good solution that's both simple and declarative yet.

aszlig on 30 May 2017

👍2

All 15 comments

Hey @danbst, what about https://nixos.org/nixos/manual/index.html#sec-writing-nixos-tests where it says:

virtualisation.writableStore
By default, the Nix store in the VM is not writable. If you enable this option, a writable union file system is mounted on top of the Nix store to make it appear writable. This is necessary for tests that run Nix operations that modify the store.

Isn't that what you are looking for in terms of a writable store?

Do you have your trials for this kind of test available somewhere?

kampfschlaefer on 4 Dec 2016

Thanks for a pointer. Looks like I skipped large part of documentation. Also, looking at firewall test, I do understand now that my intent can be implemented. But first let's consider two cases:

We are changing system configuration via /etc/nixos/configuration.nix and applying changes with $machine->succeed("NIXOS_CONFIG=/etc/nixos/configuration.nix nixos-rebuild switch"
This is where virtualisation.writableStore should come handy.
We are switching between prebuilt configurations, with switch-to-configuration instead of nixos-rebuild switch. This is very much the same, because we care mostly about activation step.

So, first case is closer to user experience, but hard to implement (how would you know what is configuration.nix content in virtual environment? and how would you express the change?)

As for the second case, it is doable. Here is an example that shows problem with networking.nat:

let execSwitchTo = cfg: "${cfg.config.system.build.toplevel}/bin/switch-to-configuration test";

in import <nixpkgs/nixos/tests/make-test.nix> rec {

  nodes.machine = {
    networking.firewall.enable = false;
  };
  nodes.enableNat = {
    networking.firewall.enable = false;    
    networking.nat.enable = true;
  };

  testScript = { nodes, ...}: ''
      $machine->start;
      $machine->waitForUnit("default.target");

      # This shows that NAT is started automatically after boot
      $enableNat->start;
      $enableNat->waitForUnit("default.target");
      $enableNat->succeed("systemctl status nat");

      # This shows that NAT isn't started after nixos-rebuild switch, it
      # should be started manually
      $machine->succeed("${execSwitchTo nodes.enableNat}");
      $machine->waitForUnit("default.target");
      $machine->fail("systemctl status nat");
      $machine->succeed("systemctl start nat");
      $machine->succeed("systemctl status nat");
    '';
}

But it looks hacky.

instead of expressing a change of a system, we do the switch to completely different system, with different IP
we don't express a change, we supply full configs every time (I hope there is a function, that do same merge operations as imports = [ ... ] do)

danbst on 4 Dec 2016

Regarding your second solution: I don't think its as hacky as you think. There is no difference between switching to a modified version of your current profile and switching to a completely different profile. Both are treated as a new profile and nix/nixos only cares about the changes as a convenience to not restart the whole system.

Of course your current implementation of 2) does the IP-address change, but that is because of the magic the test-system does with the defined nodes. If there is a way to define the same node with your changes maybe within the 'let' block without the IP-address change, then its perfectly valid. Maybe you can mimic the stuff the test system does to the nodes (basically its configuration for eth0). If we found a way to define several states for one node, we could even add something like machine->switchTo() to the test API to make this more official.

Maybe you are onto something here, maybe all tests should be of the kind:

start a simple default config
switch to the configuration-to-test
do the tests
maybe switch to a second config
do some more tests for reload and start/stop

kampfschlaefer on 4 Dec 2016

👍2

I really like the idea of always starting a minimal nixos iso, then switching to the intended config. This should really be baked into the testing procedure. This would be tremendous !

layus on 5 Dec 2016

related, when i was trying to fix xen, a nixos-rebuild switch would restart all xen related services, but messed something up

about 2 minutes after that, a watchdog within the hypervisor would trip, and the machine would hard-reset

it made it rather difficult to debug stuff until i had figured that out, and switched to nixos-rebuild boot && shutdown -r now

but automatically testing that with nixos's qemu framework would be even more difficult

cleverca22 on 9 Dec 2016

Proposal:

Add switch to test configurations to prevent it from setting IP addresses on network devices.
Check switching of configurations at test runtime.
Add "$machine.switchToConfiguration()" to perl test system to make it more convenient.
Maybe change "$machine.start()" to start with minimal default configuration and then switch to intended configuration. This could break a lot of tests. But well, thats what it will show.

kampfschlaefer on 9 Dec 2016

👍1

Some more insights on this issue.

There exists option nesting.clone which actually solves the problem with IP and network devices. It just needs to be nicely integrated into test-runner API
1a. Though I don't like that nesting.clone accepts list. attrsOf nixosSubmodule would be better, because you can specify name (which can be used as configuration name later). Do you agree, @nbp ?
Another very-very useful test is regression testing switch from stable to unstable. For example, simulate situation when user changes a channel from 16.09 to 17.03. This feature strongly relates to reverted https://github.com/NixOS/nixpkgs/pull/11106, I think

danbst on 11 Feb 2017

One more datapoint for such tests - https://stackoverflow.com/questions/44220338/taskserver-fails-to-start-on-nixos

The test should:

set NixOS version previous before stable
start an instance of a service
rebuild NixOS against stable
check that service was restarted and works. If it wasn't restarted, reboot machine and check it works
rebuild NixOS against unstable/master
check that service was restarted and works. If it wasn't restarted, reboot machine and check it works

danbst on 30 May 2017

(cc @aszlig btw ---^)

danbst on 30 May 2017

Also, I've been thinking about making it easier to switch between system configurations, but didn't yet come up with a very good solution, but these are the ones I had in mind:

If a node isn't a function or and attribute set, but a list, use it as a series of generations and make it switchable using something like $machine->switch(10);.
Cons:
- Could end up in lots of duplicated configuration attributes.
- Using integer indices would probably make tests very hard to read, imagine 20 generations, you want to switch to generation 13 and now you need to count through all this mess.
Use a separate attribute, which in turn contains an attribute set mapping from generation name to the corresponding configuration.
Pros:
- No counting of indices anymore.
Cons:
- Still with duplicated configuration values.
Same as both options before, but use some kind of a delta DSL for transforming the configuration.
Cons:
- Complicated, because people would need to learn the DSL.
- Even more complicated, because it's not so sure which config you want to change.
(Mis-)use the module key attribute to define conditions for specific generations, for example:
nix { machine = { lib, ... }: { imports = lib.singleton { key = "disable-foo"; services.foo.enable = lib.mkForce false; }; services.foo.enable = true; }; }
So the initial generation has service foo enabled, but if you do $machine->switch('disable-foo'); the foo service is disabled.
Ditch all of these and implement these generations as ordinary module options, so you have something like config.generations which reference the configuration values to merge.
Pros:
- Easy to understand if you know the module system already.
  
  Cons:
- Very tricky to implement because we'd need to merge top-level module attributes, which could trigger an infinite recursion if we reference top-level config attributes.
- Can be very ugly if lists are involved, because you can't easily patch out single items or filter them (might however be a good idea for the module system in general as well, something like lib.filterList <prio> <fun>) so you need to mkOverride whole lists.
Similar to the previous one but instead of merging configurations, expose these options as boolean values, so the main config would look like:
nix { machine = { generation, lib, ... }: { generations = [ "enabled-foo" ]; services.foo.enable = generation.enabled-foo; }; }

aszlig on 30 May 2017

👍2

Another case that should have been caught by rebuild-switch tests: https://github.com/NixOS/nixpkgs/issues/56134

danbst on 21 Feb 2019

I tried the suggested methods to write a test that tests switch-to-configuration to aid fixing #60180 but with no success.

The problem is that both webserver and webserver2 add a DNS entry for a.example.com so
when trying to request a letsencrypt cert, sometimes letsencrypt tries to contact webserver2 which
is only there for building up the switchToNewServer command, and isn't actually running,
in which the test fails.

I want a way to build a config which I can switch to, but not create a bogus node for that config, as that
is actually undesired behaviour in this case, as it screws with the tests

let
  commonConfig = ./common/letsencrypt/common.nix;
in import ./make-test.nix {
  name = "acme";

  nodes = rec {
    letsencrypt = ./common/letsencrypt;

    webserver = { config, pkgs, ... }: {
      imports = [ commonConfig ];
      networking.firewall.allowedTCPPorts = [ 80 443 ];

      networking.extraHosts = ''
        ${config.networking.primaryIPAddress} a.example.com
      '';

      services.nginx.enable = true;
      services.nginx.virtualHosts."a.example.com" = {
        enableACME = true;
        forceSSL = true;
        locations."/".root = pkgs.runCommand "docroot" {} ''
          mkdir -p "$out"
          echo hello world > "$out/index.html"
        '';
      };
    };

    webserver2 = { config, pkgs, ... }: {
      imports = [ webserver ];
      networking.extraHosts = ''
        ${config.networking.primaryIPAddress} b.example.com
      '';
      services.nginx.virtualHosts."b.example.com" = {
        enableACME = true;
        forceSSL = true;
        locations."/".root = pkgs.runCommand "docroot" {} ''
          mkdir -p "$out"
          echo hello world > "$out/index.html"
        '';
      };
    };

    client = commonConfig;
  };

  testScript = {nodes, ...}: 
    let
      newServerSystem = nodes.webserver2.config.system.build.toplevel;
      switchToNewServer = "${newServerSystem}/bin/switch-to-configuration test";
    in
    ''
      $client->waitForUnit("default.target");

      $letsencrypt->waitForUnit("default.target");
      $letsencrypt->waitForUnit("boulder.service");


      $webserver->waitForUnit("nginx.service");
      # This step fails already, as "a.example.com" points to two servers
      $webserver->waitForUnit("acme-a.example.com.service");
      $client->succeed('curl https://a.example.com/ | grep -qF "hello world"');

      $webserver->succeed("${switchToNewServer}");
      $webserver->waitForUnit("acme-b.example.com.service");
      $client->succeed('curl https://b.example.com/ | grep -qF "hello world"');

    '';
}

methods doesn't work in this particular case, because both nodes register their domain names and now I have two nodes with the same domain name running, which obviously breaks requesting certificates. So now the tests already fail before doing switch-to-configuration

Any way to have a "new config" and "old config" without creating two nodes?

arianvp on 25 May 2019

@danbst thanks for pointing me to nesting.clone that very much seems to be what we're looking for here.

However, it doesn't seem to work in this case. build-vms.nix injects the evaluated nodes argument
into the nixos modules it is evaluating (https://github.com/NixOS/nixpkgs/blob/master/nixos/lib/build-vms.nix#L27)

whilst nesting.clone doesn't preserve this argument (it calls eval-config.nix manually, whilst build-vms-nix does too, but passes different arguments)
https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/system/activation/top-level.nix#L13-L23

So when you try to use it in a NixOS test, you'll get the following error:

while evaluating the module argument `nodes' in "/home/arian/Projects/nixpkgs/nixos/tests/common/letsencrypt/common.nix":
while evaluating the attribute '_module.args.nodes' at undefined position:
attribute 'nodes' missing, at /home/arian/Projects/nixpkgs/lib/modules.nix:163:28

Example test here:

let
  commonConfig = ./common/letsencrypt/common.nix;
in import ./make-test.nix {
  name = "acme";

  nodes = {
    letsencrypt = ./common/letsencrypt;

    webserver = { config, pkgs, ... }: {
      imports = [ commonConfig ];
      networking.firewall.allowedTCPPorts = [ 80 443 ];

      networking.extraHosts = ''
        ${config.networking.primaryIPAddress} a.example.com
      '';

      services.nginx.enable = true;
      services.nginx.virtualHosts."a.example.com" = {
        enableACME = true;
        forceSSL = true;
        locations."/".root = pkgs.runCommand "docroot" {} ''
          mkdir -p "$out"
          echo hello world > "$out/index.html"
        '';
      };

      nesting.clone = [ 
        ({config, pkgs, ... }: {
          networking.extraHosts = ''
            ${config.networking.primaryIPAddress} b.example.com
          '';
          services.nginx.virtualHosts."b.example.com" = {
            enableACME = true;
            forceSSL = true;
            locations."/".root = pkgs.runCommand "docroot" {} ''
              mkdir -p "$out"
              echo hello world > "$out/index.html"
            '';
          };
        })
      ];
    };

    client = commonConfig;
  };

  testScript =
    ''
      $client->waitForUnit("default.target");

      $letsencrypt->waitForUnit("default.target");
      $letsencrypt->waitForUnit("boulder.service");


      $webserver->waitForUnit("nginx.service");
      $webserver->waitForUnit("acme-a.example.com.service");
      $client->succeed('curl https://a.example.com/ | grep -qF "hello world"');


      # Incrementally enabling virtual hosts should just work
      $webserver->succeed("/run/current-system/fine-tune/child-1/bin/switch-to-configuration test");
      $webserver->waitForUnit("acme-b.example.com.service");
      $client->succeed('curl https://b.example.com/ | grep -qF "hello world"');

    ''; 
}

arianvp on 26 May 2019

I just created a PR that fixes these issues.

arianvp on 26 May 2019

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

Search for maintainers and people that previously touched the related code and @ mention them in a comment.
Ask on the NixOS Discourse.
Ask on the #nixos channel on irc.freenode.net.