Current master passes installer test on Hydra but for me on two different physical machines with different configurations build fails. There's more, they fail with random errors, different each time.
Examples:
machine# error: anonymous function at /tmp/root/nix/store/8ikvwg07ly32jhwx9fnhknz3xrqgvmdf-nixos-17.03.git.183eeb3/nixos/pkgs/development/libraries/libsndfile/default.nix:1:1 called with unexpected argument ‘Carbon’, at /tmp/root/nix/store/8ikvwg07ly32jhwx9fnhknz3xrqgvmdf-nixos-17.03.git.183eeb3/nixos/lib/customisation.nix:56:12
machine# (use ‘--show-trace’ to show detailed location information)
machine# missing /nix/store/kmbrk8ggg9lxyrq06s7by9iigw0m4zr1-systemd-232/example/systemd/user/basic.target
machine# builder for ‘/nix/store/2l4j5awx03ll0kn5mjv9ijxi9dgdsyng-user-units.drv’ failed with exit code 1
machine# cannot build derivation ‘/nix/store/alivhi6kns4ks9rddcns2887pd9h7m93-etc.drv’: 1 dependencies couldn't be built
machine# building path(s) ‘/nix/store/ldgzhmmpvkgrq7wv5w5yywcwi433xhjb-users-groups.json’
machine# cannot build derivation ‘/nix/store/89cpcsf0gszg3k074a5qabib3z5xkrd5-nixos-system-nixos-17.03.git.183eeb3.drv’: 1 dependencies couldn't be built
machine# error: build of ‘/nix/store/89cpcsf0gszg3k074a5qabib3z5xkrd5-nixos-system-nixos-17.03.git.183eeb3.drv’ failed
machine# /nix/store/f4x7al7cj75qpq2fralpx2zpcw66bvjx-nix-1.11.6/bin/nix-store: error while loading shared libraries: liblzma.so.5: cannot open shared object file: No such file or directory
machine: exit status 127
machine# copy-from-other-stores.pl: error: path ‘/nix/store/yrsknywymm8w3cm94k2x1gc050mbk6j9-systemd-232-dev’ is not valid
machine# copy-from-other-stores.pl: error: path ‘/nix/store/7x3rvfr5267knix95ny0nvp1sm0w9qqa-libusb-1.0.20-dev’ is not valid
machine# copy-from-other-stores.pl: error: path ‘/nix/store/3w715phcnk0prpfqbzawn1g1s6rwxkvk-libxml2-2.9.4-dev’ is not valid
machine# copy-from-other-stores.pl: error: path ‘/nix/store/fai84p66ck5md3kbf8hqwkx0rggssha3-libxslt-1.1.29-dev’ is not valid
machine# copy-from-other-stores.pl: error: path ‘/nix/store/arlnb1fiywskba1xrdf8n4rmnqadkv1x-zlib-1.2.11-dev’ is not valid
machine# building path(s) ‘/nix/store/w3y63q62a4y99m9j51zx05acspw1wxkq-pkg-config-0.29.tar.gz’
machine# copy-from-other-stores.pl: error: path ‘/nix/store/fbcb8723p24y871cdfla0y9jli4spaxx-hook’ is not valid
machine# copy-from-other-stores.pl: error: path ‘/nix/store/vp9x3smmg0wxgsqj3zlhb1nspggjfxdc-libtool-2.4.6-lib’ is not valid
machine# copy-from-other-stores.pl: error: path ‘/nix/store/mq1310msj618b4dws4s23z6mmx80y521-cyrus-sasl-2.5.10’ is not valid
machine# downloading ‘https://pkgconfig.freedesktop.org/releases/pkg-config-0.29.tar.gz’...
machine# error: unable to download ‘https://pkgconfig.freedesktop.org/releases/pkg-config-0.29.tar.gz’: Couldn't resolve host name (6)
machine# builder for ‘/nix/store/bnyvm9fdz4178i26j37yd081nrn12mnp-pkg-config-0.29.tar.gz.drv’ failed with exit code 1
Related: https://github.com/NixOS/nixpkgs/pull/18689 -- both me and @domenkozar experienced this a long time ago.
nix-build -A tests.installer.simple.x86_64-linux nixos/release.nixnixos-version, Ubuntu/Fedora: lsb_release -a, ...) a9584c9510771f96594b4461e9ea546a75bf59d4nix-env --version) 1.11.6nix-instantiate --eval '<nixpkgs>' -A lib.nixpkgsVersion) 183eeb3c0fdac8de3146aedaa6028b474f96db6fWorks for me, with one caveat - quite often some of the reboots done by the test take a quite a bit of time, e.g:
machine# [ 0.000000] Zone ranges:
machine# [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
machine# [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000002ffddfff]
machine# [ 0.000000] Normal empty
machine# [ 0.000000] Movable zone start for each n
machine# [ 0.000000] tsc: Detected 2593.994 MHz processor
machine# [ 76.867607] Calibrating delay loop (skipped) preset value.. 5187.98 BogoMIPS (lpj=2593994)
machine# [ 76.868478] pid_max: default: 32768 minimum: 301
machine# [ 76.868987] ACPI: Core revision 20160831
machine# [ 76.870146] ACPI: 1 ACPI AML tables successfully acquired and loaded
machine# [ 76.870830] Security Framework initialized
That doesn't ever seem to happen on the first boot, only on subsequent boots of the same VM. IIRC I can see those in the hydra logs as well.
Re: the random failures: I have observed that to happen when nested virtualization is used.
@dezgeg,
~ cat /sys/module/kvm_intel/parameters/nested
N
Am I safe with this (I haven't dealt with nested virt before)?
Anyway, I agree that this feels like something kernel or hardware-related...
EDIT: I have also suspected filesystems but one of my machines has ZFS and other btrfs -- different enough IMO...
By nested virtualization I mean running nix-build -A tests.installer.simple.x86_64-linux nixos/release.nix in a virtual machine (in my case, something managed by VMWare VCenter)
No, nothing like this (I feared that we nest VMs somewhere inside tests).
EDIT: nearly forgot: all other tests run fine.
Interesting. Maybe it could be a filesystem problem after all if the ones that cause problems don't have inodes that fit in 32-bits unlike they often do in ext4 (which is what I have).
Does this branch show funky/random behaviour on the directory listing hack I added? https://github.com/dezgeg/nixpkgs/tree/install-debug-hack
...
machine# can't open ./doit.cpp: 24
machine# can't open ./nixpkgs: 24
machine: exit status 0
error: command `sleep 1; cd $(readlink -f $(nix-instantiate --find-file nixpkgs)) && /nix/store/a2p5f0vpzfgd94rqjadkibkkrax71lbr-hack . >&2' unexpectedly succeeded at /nix/store/xw6r66wjxjplk1vidzi1rcv8r1432h3z-nixos-test-driver/lib/perl5/site_perl/Machine.pm line 352, <__ANONIO__> line 380.
24 = too many open files, which seems legit. Before a flurry of those errors it opened files successfully which also seems okay.
Hm, I think that usually it fails after first nixos-install run. Maybe if I stick that test a little bit later...
Yes, that is expected. But what I was wondering if that output is consistent (i.e. running nix-build again) gives different results and if some other weirdness is apparent (i.e. files seem to be belonging to a wrong directory or something like that).
I'm running another test right now (run hack after nixos-install) but I think nothing that caught the eye, and it seemed to be consistent too. I'll recheck.
Nope, it fails on same files and overall looks okay (I haven't checked the entire tree but...).
I have a crazy idea to parse error message from nixos-install and read all mentioned files to standard input. Will try this when get from work.
EDIT: and as a bonus, another error message:
machine# building the system configuration...
machine# error: infinite recursion encountered, at /tmp/root/nix/store/jc428ffk2irqf6nayhcqcv0rkllpzpd9-nixos-17.03.git.e19c54c/nixos/pkgs/development/libraries/libpng/default.nix:3:8
I pushed to that branch a kernel patch that I think might fix a problem in the 9p client. At least it didn't break anything for me...
You, sir, are awesome :D This has actually fixed my problem! I'm still testing on the second machine but I can't reproduce this bug on the first one anymore.
Can't reproduce on both machines now. Thank you very much!
Okay, great. I will submit a proper patch to LKML and keep this open until it hits the 4.9 stable.
If it wents through review in LKML I think we want to backport it to our kernels.
I now had problems like this on master, in the installer test in particular. It often ended called with unexpected argument and the exact error was always different.
Haven't had much response from LKML, guess time to ping again soon... https://patchwork.kernel.org/patch/9585901/ if we want to pick it to nixpkgs in the meantime.
Still no upstream response, guess time for a second ping... Anyway, applied to master for now (among with yet another pile of 9p fixes) in https://github.com/NixOS/nixpkgs/commit/ed41d50e9fe3d942cfde37e84de781c096309e5b.
Hmm, one test on 17.03 started to fail reliably with
path ‘...-nixos-system-machine-17.03pre-git’ is not valid
The Hydra logs show that this started precisely after kernel updates, including 4.9.29 -> 4.9.30, but this seems to depend on the host kernel anyway – I can't reproduce the error locally (4.9.29 ATM).
I also can't reproduce that failure with 4.11.3 kernel (different NixOS machine).
Okay, great. I will submit a proper patch to LKML and keep this open until it hits the 4.9 stable.
Still no upstream response, guess time for a second ping
Still not accepted by upstream?
They are in since v4.15: https://github.com/torvalds/linux/commit/d18bee424b129aa4755268feeeb1ee16cbde6afa and backported to older stables.
Hm. but we still have the patch in nixpkgs and its application does not fail
Only for linux_mptcp_93 and linux_samus_4_12.