Nixpkgs: NixOS Unstable and 16.09 Release Channel Not Updating

Created on 14 Dec 2016 · 54Comments · Source: NixOS/nixpkgs

For the past week or so, NixOS release-16.09 hasn't been updated by hydra due to issues with kernel builds.

Kernels have been timing out:

At one point, instability in the 4.8 build was also causing issues.

This is a bit dire now, because users on 16.09 using ACME / LetsEncrypt are now unable to update their certificates: https://github.com/NixOS/nixpkgs/issues/21144 making this a time-bomb as someone's certificate could be running out any moment now.

I'm not sure how to debug or investigate this, and would really like some help.

/cc @rbvermaa, @edolstra, @domenkozar

blocker security

Source

grahamc

😕2 👍1

All 54 comments

I restarted the last one and it was immediately successful. (I presume the upload to cache was OK and it hung afterwards.)

vcunat on 14 Dec 2016

@vcunat Interesting ... can you try restarting the others, too, and see how it goes?

grahamc on 14 Dec 2016

... but I think the main channel-blockers are somewhere else. Actually, I'm not even aware of 4.8 kernel being channel-critical in 16.09.

vcunat on 14 Dec 2016

One day I went through and ran the "Reproduce Locally" script locally for each failing test and every single one passed.

grahamc on 14 Dec 2016

Yeah, I've never been able to reproduce Hydra failures where the log is correct and complete to the end, followed by a timeout. I've got no idea why exactly these hangs happen, but it's certainly nothing new; maybe it was worse now due to increased machine load or something...

vcunat on 14 Dec 2016

Even the VM tests where where is an error in the log, I believe every single one passed locally :(

grahamc on 14 Dec 2016

anyway -- if the fix to this issue is changes to nixpkgs, or changes to hydra / the servers / whatever, the important thing is we get an update out.

grahamc on 14 Dec 2016

At least the firewall tests on 16.09 seem to fail reproducibly with some vsftpd problem: http://hydra.nixos.org/build/44745942 http://hydra.nixos.org/build/44745934

vcunat on 14 Dec 2016

I ran the reproduce script for those tests and they passed just fine.

grahamc on 14 Dec 2016

Those tests also started failing on master. I suspect the underlying problem might be the same as http://lists.science.uu.nl/pipermail/nix-dev/2016-December/022311.html (also maybe)

vcunat on 14 Dec 2016

@edolstra has pushed some changes to the release and master branches, and the 16.09-small channel has already updated. Looks like we should be in the clear, but I'll update again in several hours.

grahamc on 15 Dec 2016

So #20500 was the solution?

vcunat on 15 Dec 2016

Not sure, now weird stuff is failing on hydra but not locally. Example: https://hydra.nixos.org/build/44846830 @edolstra assuming these changes are from that nix upgrade, can we revert nix to the older version until more testing is done?

grahamc on 16 Dec 2016

Update: @FRidh has disabled testing on pytest, and we're now seeing issues from samba.

grahamc on 17 Dec 2016

(messaged nix-dev about this at http://lists.science.uu.nl/pipermail/nix-dev/2016-December/022366.html)

grahamc on 18 Dec 2016

I messaged to #nixos:

as I understand it, the issue is pytest has a test where it chmods a directory to 0000
and expects to not be able to write in there. If it can write, the test fails. with user
namespacing, it can write . hydra is running unstable nix, which turns on user
namespacing. the reason this test started failing out of the blue is because the nix
daemon is, technically, an impure input. so it didn't start failing as soon as nix was
changed, but the next time a big rebuild came around. to make things worse,
nixUnstable had a schema change, so hydra can't just be reset to nix-stable.

grahamc on 18 Dec 2016

Further detail I sent #nixos:

this is what user namespacing is allowing, which I think breaks posix:
mkdir test; chmod 0000 test; touch test/foo

grahamc on 18 Dec 2016

Hydra is running nixUnstable: https://github.com/NixOS/nixos-org-configurations/blob/master/hydra-provisioner/nixops.nix#L42

grahamc on 18 Dec 2016

@edolstra @rbvermaa @domenkozar should hydra be switched nixStable so that release-16.09 can be built reliably?

obadz on 18 Dec 2016

👍1

To add some urgency to this: an embargoed Exim vulnerability is going public on December 25th (who picks Christmas??)... and I'm hopeful this is fixed before then.

grahamc on 18 Dec 2016

(Perhaps it's on purpose; black hats will surely take a holiday ;-)

vcunat on 18 Dec 2016

😄1

For those who want reproduce the bugs happen on hydra:
https://github.com/lethalman/nix-user-chroot

Suffers from the same problems at it uses user namespaces. I was able to reproduce go bug for instance.

Mic92 on 19 Dec 2016

I think that not running as UID 0 in the namespace should solve most of these (if not all).

vcunat on 19 Dec 2016

nix-user-chroot does not runs as root, by mapping its own uid into the namespace. Go build still fails though.

Mic92 on 19 Dec 2016

@vcunat the issue with pytest works as a regular user, too (https://github.com/NixOS/nixpkgs/issues/21145#issuecomment-267833849)

grahamc on 19 Dec 2016

This might be related to parted build failures on Hydra now: http://hydra.nixos.org/build/45093914

When I build parted on my own box it succeeds because it skips all the device-mapper tests, due to not being run is uid 0.

On Hydra it tries to run them as it is "root" but they fail.

(See also #21281)

aristidb on 19 Dec 2016

Yes, parted looks like that, and coreutils, too, which blocks almost all Linux stuff.

vcunat on 19 Dec 2016

parted had its tests disabled here: 2fdd4973ec4e867ff76ef6608ac75e89aa0d4514 I would really rather not see lots of packages have their test suites disabled if we can avoid it.

grahamc on 19 Dec 2016

Now nixpkgs-unstable and both *-small channels got updated, at least.

vcunat on 19 Dec 2016

Commits were reverted on nix: https://github.com/NixOS/nix/commits/master

and the jobs on hydra were re-run, removing the massive 8,000 package breaking rebuild is no longer there: https://hydra.nixos.org/jobset/nixos/release-16.09

grahamc on 20 Dec 2016

@aristidb can you revert your parted change?

@FRidh can you revert the disabling of pytest tests?

grahamc on 20 Dec 2016

@grahamc I plan to use the latest pytest as default. Because of a circular dependency with hypothesis tests will need to remain disabled then.

FRidh on 20 Dec 2016

👍1

@globin, @domenkozar, @cstrahan, @nathan7 (contributors to go in nixpkgs): hydra is erroring on:

--- FAIL: TestSCMCredentials (0.00s)
    creds_test.go:66: WriteMsgUnix failed with invalid argument, want EPERM

(https://hydra.nixos.org/build/44853356/nixlog/3) in 1.6 and 1.7 at least. Seemingly related: https://github.com/golang/go/issues/10703

can you look at it?

grahamc on 20 Dec 2016

That go problem doesn't happen with my local nix (on the same nixpkgs commit).

vcunat on 20 Dec 2016

Yes, this is introduced by user namespacing. We're still using user namespacing on hydra. Only the seccomp changes were reverted.

grahamc on 20 Dec 2016

I see now. It should be run with UID ~1~ 1000 and GID ~1~ 100.

vcunat on 20 Dec 2016

KDE5 tests fail on unstable and 16.09; I don't see why currently. On unstable it's actually the last issue preventing channel update.

vcunat on 20 Dec 2016

KDE tests frequently fail for "no reason"... will investigate.

grahamc on 20 Dec 2016

@grahamc while I can patch that out, things shouldn't be running as UID 0, and I'm not particularly interested in ignoring a valid test

edef1c on 21 Dec 2016

@nathan7 the tests aren't running as uid 0 at this point.

grahamc on 21 Dec 2016

See: https://github.com/NixOS/nix/commit/3a4bd320c2c4043a4b1f73406030e9afc0677b59 this revert, put us back at uid 1000

grahamc on 21 Dec 2016

KDE5 tests fail on unstable and 16.09; I don't see why currently. On unstable it's actually the last issue preventing channel update.

These tests frequently time-out, I'm afraid.

ttuegel on 21 Dec 2016

There were several failed attempts, so I thought it was some more permanent problem, but I see 16.09 succeeded completely now. nixos-unstable is now the only long-failing channel (~12 days).

vcunat on 21 Dec 2016

16.09 is updating okay now, I think though:

~the patch for go_1_7 should be applied to 1_6~
~the patches for release-16.09 should be reviewed and applied to master~

grahamc on 21 Dec 2016

@vcunat not sure where to put this, but @LnL7 and @acowley are asking for staging to be merged to master.

grahamc on 22 Dec 2016

We need to wait for Hydra; that's what staging is for.

vcunat on 22 Dec 2016

My question was a bit more general, eg. are there reasons not to merge in the last commit that was built by hydra if the results look ok?

At the moment the sync between master and staging is pretty long IMHO.

LnL7 on 22 Dec 2016

In general there aren't, but the last one fully-built on Hydra is relatively old, meaning there were a lot of changes on master since then, so the merge would create lots of new jobs by itself. My tool estimates ~10k just for x86_64-linux and it doesn't count e.g. Haskell stuff. There are thousands of jobs queued for master even now :-/

It isn't that bad, really; I remember periods where the lag was counted in months :-)

vcunat on 22 Dec 2016

I also had to backport to go patch to go_bootstrap, which was having trouble after a mass rebuild. Now, some tests aren't completing fast enough: https://hydra.nixos.org/build/45276404/nixlog/1

grahamc on 22 Dec 2016

It's probably better to disable go tests altogether.