hydra-eval-jobs returned signal 134:
which means out-of-memory killer, most likely (I think). The ratio of successes has been rather bad several days ago already, but now we seem to be completely stuck (many attempts without any success).
Among other issues, it blocks security updates from getting to the nixos-unstable channel.
In metrics I can't see any recent anomaly around memory consumption, so I expect it's either something around NixOS stuff (like OS, tests, etc.)... or we just very slowly got over the limit.
I have pinged @rbvermaa and @edolstra.
For the record, it's been weeks since it started happening with regularity.
Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS
I've been hapazardly requeuing the eval a bunch lately, due to that. It eventually finishes to eval, but seems like it's getting (anecdotally) harder for it to pass.
And for the record, OOM will not be 134, but signal 9, AFAICT. 134 is boehm GC failing.
And... well, sorry for adding again, this, or a linked issue, has been stopping us from adding aarch64-linux
to supported systems. Figuring a solution to this is likely going to unblock us from doing it.
Oh, right, I miscalculated... 134 is probably abort()
(e.g. via assert).
We're going to get a new machine for Hydra.
We should be able to better evaluate then.
I hope that will really last us for years.
I'm not sure what's the time plan, but in the meantime... remove aarch64 from this jobset? (trunk jobset will still us provide with binaries and regression visibility for the larger parts)
@grahamc any idea when that is approximately going to happen?
We've ordered the server. Hopefully we'll get it today.
I hope that will really last us for years.
Years, or the addition of armv6l, riscv, and power9 :wink:
The server took a while to get provisioned, but we now have access to it.
We asked for it to be provisioned very nearby to the current hydra server, to:
1) make transferring the state faster
2) reduce latency from hydra to the postgres database
We suspect this special request made it take a longer than usual amount of time.
... but hydra.nixos.org hasn't been migrated so far, right? (To be clear, I don't mean that as critique or anything.)
Correct. We only just took control of the chassis, and we'll begin provisioning today. Probably after CEST work hours :)
We started transferring GC roots and derivations over to the new server several hours ago.
The new server is up and running.
Most helpful comment
We're going to get a new machine for Hydra.
We should be able to better evaluate then.