Nixpkgs: [Hydra] nixos tests often fails due to timeouts, blocking channel updates

Created on 14 Jan 2019 · 21Comments · Source: NixOS/nixpkgs

Issue description

There's been discussion around this issue various places, most notably https://github.com/NixOS/nixpkgs/issues/49384. But I thought it deserved it's own issue.

To quote @vcunat and summarize the problem operationally:

Without manual test restarts we probably can't keep up reasonable channel update frequency. On unstable it's common to have other regressions, too.

The installer tests seems to be affected the most, but various other tests also time out:

Causes?

Quoting @vcunat again for a likely cause:

Lately I relatively often see transient nixos test errors from packet-epyc-1; I suspect it may get overloaded occasionally.

I can only guess why that happens. One possibility: some derivations don't use make -l$NIX_BUILD_CORES or similar – I find that very annoying locally, as I have a cheap 16-threaded desktop and 16*16 is way too much (--cores + --max-jobs).

That would explain why we're suddenly seeing this so much with the new epyc machine (which have a lot of cores).

Fixes?

There's been attempts to increase the timeout https://github.com/NixOS/nixpkgs/pull/49441 (reverted) and #53828, without any immediate success. https://github.com/NixOS/nixpkgs/pull/53827 added time logging which could be useful for debugging.

An ad-hoc fix would be to automate the manual restarts (eg. if the tested job fails, restart it once). Though that's obviously not ideal.

nixos testing

Source

hedning

All 21 comments

Someone has suggested using cgroups to limit the insides of build sandboxes, but I'm a bit at a loss about how to balance this so that the bigger machines don't get under-utilized. Example: if there's only one build at a particular time+machine that can use high parallelism, we do want it to use lots of cores; it just depends on the overall load.

vcunat on 14 Jan 2019

Someone has suggested using cgroups to limit the insides of build sandboxes, but I'm a bit at a loss about how to balance this so that the bigger machines don't get under-utilized.

I always thought (and a search confirms) that the default for cgroups is to permit the other groups to use any extra resources proportionally to their quotas; of course, the _number_ of parallel build jobs is a darker art than mere resource limits.

7c6f434c on 14 Jan 2019

❤1

The cgroups solution can be setup such that a cgroup can only use dedicated cores of the machine (probably requires a tight integration with nix. Didn't @Mic92 talk about that on NixCon?). Slurm, for example, does this very well for HPC jobs to avoid resource conflicts between jobs on the same machine (CPU + MEM). This would also constrain processes that ignore max-cores and try to run all available cores. Some test routines, for example, that are compiled against OpenMP are somewhat ill-behaved in that way.

Another potential problem causing these timeouts could also be memory over utilization? A linux system can become unresponsive for quite some time until the OOM killer resolves the problem. Have OOM events been observed on the build machines?

What about using a setup such that max-jobs * max-cores = N*numberOfCores, where N is between 1 and 2. This would avoid a massive over utilization and potentially make the system more reliable. It might lead to under-utilization under some circumstances but it isn't this better than tests that fail randomly? In the end such a setup might even be faster overall than one usingmax-jobs * max-cores = numberOfCores^2?

markuskowa on 14 Jan 2019

A linux system can become unresponsive for quite some time until the OOM killer resolves the problem.

Actually I noticed that the usual Linux config behaves extremely badly for me – with /tmp as %50 RAM tmpfs, moderately filled, no swap and overall RAM filled by nix builder jobs, I get complete lag for dozens of minutes, i.e. not a single X screen redraw or any kind of response (my patience usually runs out much earlier, I've never seen it recover). EDIT: the builders have lowered CPU and I/O priorities, of course.

vcunat on 14 Jan 2019

I had a monologue yesterday about the current investigation for the timing out tests.

05:37 <samueldr> on packet-epyc-1, on a small non-random sample, when tests pass connecting takes in the ballpark of 80~120 seconds
05:38 <samueldr> so it probably wasn't an issue of being close to the alarm time
05:39 <samueldr> it probably is something that once it happens, it's guaranteed not to connect... now what is that "something"?
05:40 <samueldr> I'm a bit confused, on successful builds only some logs tell on which machine it was built
05:43 <samueldr> ah, somehow depending on the link, it's either https://hydra.nixos.org/build/87199167/log or https://hydra.nixos.org/build/87199167/nixlog/23; a cached build points to nixlog/{step} while a non-cached build to /log, nixlog/{step} shows the machine name
05:43 <samueldr> so a non-cached build, going to the "build steps" tab will help knowing on which machine it was built
05:46 <samueldr> oh, one diff between dmesg output I found is that on a successful build there is "console [hvc1] enabled"; which is probably the required console to work `console=hvc1`
05:46 <samueldr> so first thing to check probably is what would cause some racing with hvc1
05:47 <samueldr> up until that line, and even after that line, the dmesg logs are pretty much identical, except for some lines swapping places in (probably) innocuous places and timings
05:49 <samueldr> not all failures are hvc1 failing to load, but those where the alarm triggered seem to be
05:49 <samueldr> (multi machines test can have one machine seemingly fail to start hvc1, while the other loaded)

https://logs.nix.samueldr.com/nixos-dev/2019-01-13#1547357852-1547358570;

That is, after adding timing logging and doubling the timeout for connection, tests still timeout into connecting, which according do my investigation, it looks like the serial backdoor into our tests VM sometimes won't start. This is not a guaranteed thing since in a test with multiple VMs or reboots it may happen on a later boot.

samueldr on 14 Jan 2019

@srhb had a hunch it might be the dev backdoor due to the commit timing, https://github.com/NixOS/nixpkgs/pull/49441#issuecomment-434210942: https://github.com/NixOS/nixpkgs/pull/47418.

Might be worth reverting on either master or release and see if it fixes the issue? (ping @domenkozar who wrote the dev backdoor)

hedning on 14 Jan 2019

The thing is, AFAIUI the issue would still be present, but disabling the use of that serial console for tests would hide the fact. There's probably something in the kernel or during our boot process that races against the serial console being activated. Hopefully only that console is affected, but maybe there's a more general issue here? (Those are assumptions, no hard facts other than hvc1 not showing up.)

samueldr on 14 Jan 2019

I'm fine reverting on master if we'd like to see if this fixes the issue.

domenkozar on 19 Jan 2019

Was that backdoor mainly for developer access or are there other benefits?

Could the backdoor be kept, but not the default console, and neither the console used by the VM tests? Could the backdoor be kept under an option and/or an attibute so e.g. nix-build nixos/release.nix -A tests.login.withBackdoor.x86_64-linux would point to the backdoored variant?

Anything wrong in the statements?

samueldr on 19 Jan 2019

If no one actually restarted this job, then the revert does look promising. Full green build: https://hydra.nixos.org/build/87682009#tabs-constituents

hedning on 20 Jan 2019

I restarted three tests in this set, but that's a fraction of what's been required lately. _Of course, I could've missed someone else restarting as well._

vcunat on 20 Jan 2019

Was that backdoor mainly for developer access or are there other benefits?

For developers to be able to debug VM tests. I'm fine if it's an option until someone finds time to debug what's the race.

domenkozar on 20 Jan 2019

👍1

@vcunat what were the failure modes for the three tests?

samueldr on 20 Jan 2019

Seems like reverting the serial backdoor did it.

In the other tests, no failures at connecting to the serial port except one where it looks like the timeout was legit.
small had a bunch of good evals

This was backported to 18.09.

@hedning does it sound "fixed" enough for the current issue?

We might want to create a tracking issue to re-add the serial backdoor as an optional feature.

samueldr on 21 Jan 2019

I didn't save (nor remember) the details of the errors. I tried them locally (all, I think) and after success restarted them on Hydra.

vcunat on 21 Jan 2019

does it sound "fixed" enough for the current issue?

Probably, it does seem like the 2 hour stuck failures are gone at least. I'm guessing this is a «legit» timeout https://hydra.nixos.org/build/87690500?

hedning on 21 Jan 2019

boot-after-install# 
boot-after-install# 
boot-after-install# Booting from Hard Disk...
boot-after-install# GRUB loading.
boot-after-install# Welcome to GRUB!
boot-after-install# 
(600.00 seconds)
(600.00 seconds)
error: timed out waiting for the VM to connect

Either grub hung, or getting to there took 600s. So yeah, there could be something else going on as another different test had a similar failure. Maybe it's related to the serial backdoor failure, probably isn't.

samueldr on 21 Jan 2019

Did the backdoor revert help?

domenkozar on 11 Feb 2019

Tests on Hydra seemed quite reliable since that time. I _suppose_ this revert was the main contribution to that (had to be 100% sure for me). In any case, I see no reason to keep this open anymore :tada:

vcunat on 11 Feb 2019

I wish PR https://github.com/NixOS/nixpkgs/pull/54330/files isn't lost in space. Maybe some mention in NixOS tests, that if you want to debug tests from inside, apply custom patch from that PR.