There's been discussion around this issue various places, most notably https://github.com/NixOS/nixpkgs/issues/49384. But I thought it deserved it's own issue.
To quote @vcunat and summarize the problem operationally:
Without manual test restarts we probably can't keep up reasonable channel update frequency. On unstable it's common to have other regressions, too.
The installer tests seems to be affected the most, but various other tests also time out:
Quoting @vcunat again for a likely cause:
Lately I relatively often see transient nixos test errors from packet-epyc-1; I suspect it may get overloaded occasionally.
I can only guess why that happens. One possibility: some derivations don't use make -l$NIX_BUILD_CORES or similar – I find that very annoying locally, as I have a cheap 16-threaded desktop and 16*16 is way too much (--cores + --max-jobs).
That would explain why we're suddenly seeing this so much with the new epyc machine (which have a lot of cores).
There's been attempts to increase the timeout https://github.com/NixOS/nixpkgs/pull/49441 (reverted) and #53828, without any immediate success. https://github.com/NixOS/nixpkgs/pull/53827 added time logging which could be useful for debugging.
An ad-hoc fix would be to automate the manual restarts (eg. if the tested job fails, restart it once). Though that's obviously not ideal.
Someone has suggested using cgroups to limit the insides of build sandboxes, but I'm a bit at a loss about how to balance this so that the bigger machines don't get under-utilized. Example: if there's only one build at a particular time+machine that can use high parallelism, we do want it to use lots of cores; it just depends on the overall load.
Someone has suggested using cgroups to limit the insides of build sandboxes, but I'm a bit at a loss about how to balance this so that the bigger machines don't get under-utilized.
I always thought (and a search confirms) that the default for cgroups is to permit the other groups to use any extra resources proportionally to their quotas; of course, the _number_ of parallel build jobs is a darker art than mere resource limits.
The cgroups solution can be setup such that a cgroup can only use dedicated cores of the machine (probably requires a tight integration with nix. Didn't @Mic92 talk about that on NixCon?). Slurm, for example, does this very well for HPC jobs to avoid resource conflicts between jobs on the same machine (CPU + MEM). This would also constrain processes that ignore max-cores and try to run all available cores. Some test routines, for example, that are compiled against OpenMP are somewhat ill-behaved in that way.
Another potential problem causing these timeouts could also be memory over utilization? A linux system can become unresponsive for quite some time until the OOM killer resolves the problem. Have OOM events been observed on the build machines?
What about using a setup such that max-jobs * max-cores = N*numberOfCores, where N is between 1 and 2. This would avoid a massive over utilization and potentially make the system more reliable. It might lead to under-utilization under some circumstances but it isn't this better than tests that fail randomly? In the end such a setup might even be faster overall than one usingmax-jobs * max-cores = numberOfCores^2?
A linux system can become unresponsive for quite some time until the OOM killer resolves the problem.
Actually I noticed that the usual Linux config behaves extremely badly for me – with /tmp as %50 RAM tmpfs, moderately filled, no swap and overall RAM filled by nix builder jobs, I get complete lag for dozens of minutes, i.e. not a single X screen redraw or any kind of response (my patience usually runs out much earlier, I've never seen it recover). EDIT: the builders have lowered CPU and I/O priorities, of course.
I had a monologue yesterday about the current investigation for the timing out tests.
05:37 <samueldr> on packet-epyc-1, on a small non-random sample, when tests pass connecting takes in the ballpark of 80~120 seconds
05:38 <samueldr> so it probably wasn't an issue of being close to the alarm time
05:39 <samueldr> it probably is something that once it happens, it's guaranteed not to connect... now what is that "something"?
05:40 <samueldr> I'm a bit confused, on successful builds only some logs tell on which machine it was built
05:43 <samueldr> ah, somehow depending on the link, it's either https://hydra.nixos.org/build/87199167/log or https://hydra.nixos.org/build/87199167/nixlog/23; a cached build points to nixlog/{step} while a non-cached build to /log, nixlog/{step} shows the machine name
05:43 <samueldr> so a non-cached build, going to the "build steps" tab will help knowing on which machine it was built
05:46 <samueldr> oh, one diff between dmesg output I found is that on a successful build there is "console [hvc1] enabled"; which is probably the required console to work `console=hvc1`
05:46 <samueldr> so first thing to check probably is what would cause some racing with hvc1
05:47 <samueldr> up until that line, and even after that line, the dmesg logs are pretty much identical, except for some lines swapping places in (probably) innocuous places and timings
05:49 <samueldr> not all failures are hvc1 failing to load, but those where the alarm triggered seem to be
05:49 <samueldr> (multi machines test can have one machine seemingly fail to start hvc1, while the other loaded)
That is, after adding timing logging and doubling the timeout for connection, tests still timeout into connecting, which according do my investigation, it looks like the serial backdoor into our tests VM sometimes won't start. This is not a guaranteed thing since in a test with multiple VMs or reboots it may happen on a later boot.
@srhb had a hunch it might be the dev backdoor due to the commit timing, https://github.com/NixOS/nixpkgs/pull/49441#issuecomment-434210942: https://github.com/NixOS/nixpkgs/pull/47418.
Might be worth reverting on either master or release and see if it fixes the issue? (ping @domenkozar who wrote the dev backdoor)
The thing is, AFAIUI the issue would still be present, but disabling the use of that serial console for tests would hide the fact. There's probably something in the kernel or during our boot process that races against the serial console being activated. Hopefully only that console is affected, but maybe there's a more general issue here? (Those are assumptions, no hard facts other than hvc1 not showing up.)
I'm fine reverting on master if we'd like to see if this fixes the issue.
Was that backdoor mainly for developer access or are there other benefits?
Could the backdoor be kept, but not the default console, and neither the console used by the VM tests? Could the backdoor be kept under an option and/or an attibute so e.g. nix-build nixos/release.nix -A tests.login.withBackdoor.x86_64-linux would point to the backdoored variant?
Anything wrong in the statements?
If no one actually restarted this job, then the revert does look promising. Full green build: https://hydra.nixos.org/build/87682009#tabs-constituents
I restarted three tests in this set, but that's a fraction of what's been required lately. _Of course, I could've missed someone else restarting as well._
Was that backdoor mainly for developer access or are there other benefits?
For developers to be able to debug VM tests. I'm fine if it's an option until someone finds time to debug what's the race.
@vcunat what were the failure modes for the three tests?
Seems like reverting the serial backdoor did it.
This was backported to 18.09.
@hedning does it sound "fixed" enough for the current issue?
We might want to create a tracking issue to re-add the serial backdoor as an optional feature.
I didn't save (nor remember) the details of the errors. I tried them locally (all, I think) and after success restarted them on Hydra.
does it sound "fixed" enough for the current issue?
Probably, it does seem like the 2 hour stuck failures are gone at least. I'm guessing this is a «legit» timeout https://hydra.nixos.org/build/87690500?
boot-after-install#
boot-after-install#
boot-after-install# Booting from Hard Disk...
boot-after-install# GRUB loading.
boot-after-install# Welcome to GRUB!
boot-after-install#
(600.00 seconds)
(600.00 seconds)
error: timed out waiting for the VM to connect
Either grub hung, or getting to there took 600s. So yeah, there could be something else going on as another different test had a similar failure. Maybe it's related to the serial backdoor failure, probably isn't.
Did the backdoor revert help?
Tests on Hydra seemed quite reliable since that time. I _suppose_ this revert was the main contribution to that (had to be 100% sure for me). In any case, I see no reason to keep this open anymore :tada:
I wish PR https://github.com/NixOS/nixpkgs/pull/54330/files isn't lost in space. Maybe some mention in NixOS tests, that if you want to debug tests from inside, apply custom patch from that PR.
Sadly it doesn't work for me anymore.