Nixpkgs: NixOS Unstable and 16.09 Release Channel Not Updating

Created on 14 Dec 2016  路  54Comments  路  Source: NixOS/nixpkgs

For the past week or so, NixOS release-16.09 hasn't been updated by hydra due to issues with kernel builds.

Kernels have been timing out:

At one point, instability in the 4.8 build was also causing issues.

This is a bit dire now, because users on 16.09 using ACME / LetsEncrypt are now unable to update their certificates: https://github.com/NixOS/nixpkgs/issues/21144 making this a time-bomb as someone's certificate could be running out any moment now.

I'm not sure how to debug or investigate this, and would really like some help.

/cc @rbvermaa, @edolstra, @domenkozar

blocker security

All 54 comments

I restarted the last one and it was immediately successful. (I presume the upload to cache was OK and it hung afterwards.)

@vcunat Interesting ... can you try restarting the others, too, and see how it goes?

... but I think the main channel-blockers are somewhere else. Actually, I'm not even aware of 4.8 kernel being channel-critical in 16.09.

One day I went through and ran the "Reproduce Locally" script locally for each failing test and every single one passed.

Yeah, I've never been able to reproduce Hydra failures where the log is correct and complete to the end, followed by a timeout. I've got no idea why exactly these hangs happen, but it's certainly nothing new; maybe it was worse now due to increased machine load or something...

Even the VM tests where where is an error in the log, I believe every single one passed locally :(

anyway -- if the fix to this issue is changes to nixpkgs, or changes to hydra / the servers / whatever, the important thing is we get an update out.

At least the firewall tests on 16.09 seem to fail reproducibly with some vsftpd problem: http://hydra.nixos.org/build/44745942 http://hydra.nixos.org/build/44745934

I ran the reproduce script for those tests and they passed just fine.

Those tests also started failing on master. I suspect the underlying problem might be the same as http://lists.science.uu.nl/pipermail/nix-dev/2016-December/022311.html (also maybe)

@edolstra has pushed some changes to the release and master branches, and the 16.09-small channel has already updated. Looks like we should be in the clear, but I'll update again in several hours.

So #20500 was the solution?

Not sure, now weird stuff is failing on hydra but not locally. Example: https://hydra.nixos.org/build/44846830 @edolstra assuming these changes are from that nix upgrade, can we revert nix to the older version until more testing is done?

Update: @FRidh has disabled testing on pytest, and we're now seeing issues from samba.

(messaged nix-dev about this at http://lists.science.uu.nl/pipermail/nix-dev/2016-December/022366.html)

I messaged to #nixos:

as I understand it, the issue is pytest has a test where it chmods a directory to 0000
and expects to not be able to write in there. If it can write, the test fails. with user
namespacing, it can write . hydra is running unstable nix, which turns on user
namespacing. the reason this test started failing out of the blue is because the nix
daemon is, technically, an impure input. so it didn't start failing as soon as nix was
changed, but the next time a big rebuild came around. to make things worse,
nixUnstable had a schema change, so hydra can't just be reset to nix-stable.

Further detail I sent #nixos:

this is what user namespacing is allowing, which I think breaks posix:
mkdir test; chmod 0000 test; touch test/foo

@edolstra @rbvermaa @domenkozar should hydra be switched nixStable so that release-16.09 can be built reliably?

To add some urgency to this: an embargoed Exim vulnerability is going public on December 25th (who picks Christmas??)... and I'm hopeful this is fixed before then.

(Perhaps it's on purpose; black hats will surely take a holiday ;-)

For those who want reproduce the bugs happen on hydra:
https://github.com/lethalman/nix-user-chroot

Suffers from the same problems at it uses user namespaces. I was able to reproduce go bug for instance.

I think that not running as UID 0 in the namespace should solve most of these (if not all).

nix-user-chroot does not runs as root, by mapping its own uid into the namespace. Go build still fails though.

@vcunat the issue with pytest works as a regular user, too (https://github.com/NixOS/nixpkgs/issues/21145#issuecomment-267833849)

This might be related to parted build failures on Hydra now: http://hydra.nixos.org/build/45093914

When I build parted on my own box it succeeds because it skips all the device-mapper tests, due to not being run is uid 0.

On Hydra it tries to run them as it is "root" but they fail.

(See also #21281)

Yes, parted looks like that, and coreutils, too, which blocks almost all Linux stuff.

parted had its tests disabled here: 2fdd4973ec4e867ff76ef6608ac75e89aa0d4514 I would really rather not see lots of packages have their test suites disabled if we can avoid it.

Now nixpkgs-unstable and both *-small channels got updated, at least.

Commits were reverted on nix: https://github.com/NixOS/nix/commits/master

and the jobs on hydra were re-run, removing the massive 8,000 package breaking rebuild is no longer there: https://hydra.nixos.org/jobset/nixos/release-16.09

@aristidb can you revert your parted change?

@FRidh can you revert the disabling of pytest tests?

@grahamc I plan to use the latest pytest as default. Because of a circular dependency with hypothesis tests will need to remain disabled then.

@globin, @domenkozar, @cstrahan, @nathan7 (contributors to go in nixpkgs): hydra is erroring on:

--- FAIL: TestSCMCredentials (0.00s)
    creds_test.go:66: WriteMsgUnix failed with invalid argument, want EPERM

(https://hydra.nixos.org/build/44853356/nixlog/3) in 1.6 and 1.7 at least. Seemingly related: https://github.com/golang/go/issues/10703

can you look at it?

That go problem doesn't happen with my local nix (on the same nixpkgs commit).

Yes, this is introduced by user namespacing. We're still using user namespacing on hydra. Only the seccomp changes were reverted.

I see now. It should be run with UID ~1~ 1000 and GID ~1~ 100.

KDE5 tests fail on unstable and 16.09; I don't see why currently. On unstable it's actually the last issue preventing channel update.

KDE tests frequently fail for "no reason"... will investigate.

@grahamc while I can patch that out, things shouldn't be running as UID 0, and I'm not particularly interested in ignoring a valid test

@nathan7 the tests aren't running as uid 0 at this point.

KDE5 tests fail on unstable and 16.09; I don't see why currently. On unstable it's actually the last issue preventing channel update.

These tests frequently time-out, I'm afraid.

There were several failed attempts, so I thought it was some more permanent problem, but I see 16.09 succeeded completely now. nixos-unstable is now the only long-failing channel (~12 days).

16.09 is updating okay now, I think though:

  1. ~the patch for go_1_7 should be applied to 1_6~
  2. ~the patches for release-16.09 should be reviewed and applied to master~

@vcunat not sure where to put this, but @LnL7 and @acowley are asking for staging to be merged to master.

We need to wait for Hydra; that's what staging is for.

My question was a bit more general, eg. are there reasons not to merge in the last commit that was built by hydra if the results look ok?

At the moment the sync between master and staging is pretty long IMHO.

In general there aren't, but the last one fully-built on Hydra is relatively old, meaning there were a lot of changes on master since then, so the merge would create lots of new jobs by itself. My tool estimates ~10k just for x86_64-linux and it doesn't count e.g. Haskell stuff. There are thousands of jobs queued for master even now :-/

It isn't that bad, really; I remember periods where the lag was counted in months :-)

I also had to backport to go patch to go_bootstrap, which was having trouble after a mass rebuild. Now, some tests aren't completing fast enough: https://hydra.nixos.org/build/45276404/nixlog/1

It's probably better to disable go tests altogether.

That could be. I wonder, though, if it is because the machine it is running on is highly loaded, due to the mass rebuild?

The go tests are pretty cpu intensive

They passed ok, but the instability is a bit troubling. Not the biggest worry, perhaps, though...

All the five channels are recent now :-)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

langston-barrett picture langston-barrett  路  3Comments

grahamc picture grahamc  路  3Comments

rzetterberg picture rzetterberg  路  3Comments

ob7 picture ob7  路  3Comments

tomberek picture tomberek  路  3Comments