Nixpkgs: Hydra: nixos/release-20.03 and unstable fails to evaluate

Created on 12 Feb 2020  路  45Comments  路  Source: NixOS/nixpkgs

Describe the bug
The nixos/release-20.03 jobset fails to evaluate:

hydra-eval-jobs returned signal 9:
(no output)

I've tried several times to trigger an evaluation, yet every time it fails.

cc @disasm @worldofpeace @grahamc @vcunat

bug regression blocker channel blocker

Most helpful comment

Great :heart: I pushed 20.03 backports.

I believe the issue is fixed and shouldn't re-appear anytime soon. Possible TODOs:

  • [ ] backport to 19.09. It probably will keep evaluating without that, but we could have it cheaper (for the several remaining months). It surely doesn't apply cleanly, but it should be a mechanical change.
  • [ ] still consider the approach from LnL; perhaps we can get even better performance thanks to that.

All 45 comments

The last few weeks felt like we're slowly making the big nixos eval too expensive (again). Maybe not just nixos, as I've seen increase in out-of-memory failures also in jobs like tarball, but perhaps it was just a feeling as I see no significant increase in these graphs: https://hydra.nixos.org/job/nixpkgs/trunk/metrics#tabs-charts

cc @disassembler

Noticed this as well, can't open ZHF until there's an eval on the jobset.

The thinking from Eelco is the growth of NixOS tests is causing memory pressure problems. Each VM in the tests adds a few hundred MB of RAM consumption for hydra's evaluator.

It feels bad to be "within 5 tests" of being unable to move forward. :(

To clarify, @edolstra's suggestion short-term is to remove some of the tests. For example, those key map tests were commented for a very long time. It would be sad to drop them again but it may be the best short-term solution. Long-term, there is a branch for a more precise GC, and possibly some optimisation work which could be made in how NixOS is evaluated.

but I don't know if either of these more long-term things are possible _today_.

That said, I'm 100% not the right person for this problem, and possibly @LnL7, @samueldr, or @fpletz, @Ma27 have advice on how to tune hydra's evaluator.

@grahamc does it just run out of memory, or why does it fail to evaluate?

This was for different reasons but I've been tracking the stdenv requisite size for quite a while now, could be totally unrelated but that had a rather large jump recently.

linux-requisites

Update this was between fa7445532900f2555435076c1e7dce0684daa01a and d453c2f5d8e211a6f45f59f67da8a22625627b86. Most likely the libidn2 change at first glance.

@LnL7 could we bisect around that?

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nixos-20-03-feature-freeze/5655/32

Does anybody know why this only occurs for 20.03 and not trunk-combined? Evaluation for those should be equivalent (except for stableBranch but is/should be purely metadata).

I've seen killed trunk-combined tasks earlier today while trying to trigger a eval.

@flokli 447edaa32fcee706be24db4389f4759fad68a785 looks like python (and not a minimal build) was introduced in the stdenv. A minimal python would bring it down from ~270 to ~240.

If that's the case, we might just want to remove that reference -

I don't really see a reason why python should become part of glibc's runtime closure.

I'm not sure how the closure sizes are relevant to this thread, but I can't see a significant increase of (runtime) closure size for stdenv output path on x86_64-linux (and python is not there).

stdenv size didn't change much:

[13:37:37] jon@jon-workstation ~/projects/nixpkgs (master)
$ nix path-info -Sh ./result
/nix/store/5gc1hyqbxwfwcw7l1bs7gy6rw9zbnc09-stdenv-linux     231.6M
[13:39:35] jon@jon-workstation ~/projects/nixpkgs (release-19.09)
$ nix path-info -Sh ./result
/nix/store/qghrkvk86f9llfkcr1bxsypqbw1a4qmw-stdenv-linux     224.4M

and python is not in the runtime closure:

[13:40:47] jon@jon-workstation ~/projects/nixpkgs (master)
$ nix-store -q --tree ./result | grep python
[13:40:58] jon@jon-workstation ~/projects/nixpkgs (master)

@vcunat It could be something totally different, but given that nixos instances will evaluate pkgs multiple times it's something that increases evaluation for _each_ test.

I did notice that the hydra jobsets for "trunk" now take over 100 seconds to evaluate, where they use to be significantly lower when I first started viewing hydra >6 months ago.

The evaluator dies with hydra-eval-jobs returned signal 9 but also random builds fail with 9. Would the evaluator kill remote jobs when it runs out of memory? Or could those be builds that happen to run on the evaluator?

No, I believe there are no such connections.

Ouch, having glibc depend on python is really unfortunate.

It was upstream decision to use python in the build process (build-time only dependency). I don't think we can do much about that. EDIT: using some minimal python could be nice, though.

@vcunat you could probably switch that occurence to python3Minimal, introduced in https://github.com/NixOS/nixpkgs/pull/66762, which should have a smaller build and runtime closure - if you don't rely on things like libreadline or ssl support.

OK, I submitted #80112, but I still can't see how it's relevant to this thread.

Based on the gc stats from nix the memory needed to evaluate eg. hello increased from 26mb -> 29mb with the glibc update (this has now doubled compared to 18.03 btw). This indeed isn't a big deal since it's a flat cost per architecture. However that's not the case for nixos instances, since each test imports it's own instance of nixpkgs.

I can't evaluate everything on my machine with the current settings, but evaluating just the tests seems to use between 600mb and 1.5Gb more before reverting that commit. With the way evaluation currently works that's a problem if this bumps up the memory usage enough to require a larger heap.

I don't know how much memory the hydra evaluator has available, but with GC_INITIAL_HEAP_SIZE=20G both 20.03 and older releases evaluated without issues. The larger heap size does result in higher average memory usage however which might be a problem for concurrent evaluations.

If I look correctly, using python3Minimal recovers only a small fraction of this increase.

Yeah, I'm not sure there's a good solution for this other than trying to reduce the memory "enough" without more fundamental changes.

I took a quick look at the evaluation for tests, this probably isn't the right place to change and I think it would break tests that use overlays as well as multiple architectures. But something similar might work to reduce the overhead for tests quite significantly.

diff --git a/nixos/lib/build-vms.nix b/nixos/lib/build-vms.nix
index 1bad63b9194..8da2504bea9 100644
--- a/nixos/lib/build-vms.nix
+++ b/nixos/lib/build-vms.nix
@@ -36,6 +36,7 @@ rec {
       baseModules =  (import ../modules/module-list.nix) ++
         [ ../modules/virtualisation/qemu-vm.nix
           ../modules/testing/test-instrumentation.nix # !!! should only get added for automated test runs
+          { key = "nixpkgs-pkgs"; nixpkgs.pkgs = pkgs; }
           { key = "no-manual"; documentation.nixos.enable = false; }
           { key = "qemu"; system.build.qemu = qemu; }
           { key = "nodes"; _module.args.nodes = nodes; }

However that's not the case for nixos instances, since each test imports it's own instance of nixpkgs.

My reading of _that_ part is that pkgs is passed through and not re-imported.

The idea for VM tests seems intriguing. Overlays appear considered at a quick glance.

I tried your patch with evaluation of just a pair of tests at once, and it decreased gc.totalBytes by ~22%

Yeah, I linked the wrong thing.

Overlays appear considered at a quick glance.

That looks promising, threading through pkgs for the correct system instead of just pkgs (which is always x86_64-linux) to buildMV might be an option then. I won't have time to look into this further for a few days however.

I don't know these parts of code well, but I looked around and I still can't see any problem with that patch. I tried on Hydra, but it's still getting killed: https://hydra.nixos.org/jobset/nixos/nixos-test-expensive-eval (2/2 eval attempts killed)

When I restricted it to just x86_64-linux, it succeeded on second attempt. I'm hopeful to use this approach for now. Note that 20.03 was also created just for x86_64-linux and couldn't get evaluation even after cutting some tests in ceb90b08e... at least until a while ago (not sure what's changed).

Therefore I still expect that patch helped significantly; I'd still check diff in test failures before using it for real.

It looks like Eelco has whipped up a miracle and got evaluations passing, and in less time too.

Bought a better server? :-) In any case, it will be nice to know how he managed it, as it's a never-ending problem. EDIT: I suspect it was some kind of cheating, as we no longer have the aggregate tested job, neither in trunk-combined nor in release-20.03.

For long-term solutions of RAM consumption I have high hopes for https://github.com/NixOS/hydra/issues/715

Evaluation 1570647 of jobset nixos:nixos-test-expensive-eval
Compare to...

This evaluation was performed on 2020-02-15 00:59:23. Fetching the dependencies took 3s and evaluation took 1109s

oof, 20mins for an eval. That's rough

That seems quite a normal number IIRC. (for our big jobsets like trunk-combined)

OK, let me ask explicitly about that miracle: how are channels going to work when we have no tested job anymore? Perhaps I just don't understand the intentions.

The tested job is back (it was never gone but it did have an evaluation error). We'll need to backport 2de3caf01109891cfc2645b0ad07ac36aedadd1e and 895042956f279ae8ebc9fd026664cea8198f71ec to the 20.03 branch.

Great :heart: I pushed 20.03 backports.

I believe the issue is fixed and shouldn't re-appear anytime soon. Possible TODOs:

  • [ ] backport to 19.09. It probably will keep evaluating without that, but we could have it cheaper (for the several remaining months). It surely doesn't apply cleanly, but it should be a mechanical change.
  • [ ] still consider the approach from LnL; perhaps we can get even better performance thanks to that.

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nixos-20-03-beta/5935/1

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/firefox-not-up-to-date/5941/2

No good, even the small channels are blocked now: https://github.com/NixOS/hydra/issues/715#issuecomment-587693274

Resolved and today all channels even got updated.

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nixos-20-03-beta/5935/7

Was this page helpful?
0 / 5 - 0 ratings

Related issues

retrry picture retrry  路  3Comments

spacekitteh picture spacekitteh  路  3Comments

copumpkin picture copumpkin  路  3Comments

sid-kap picture sid-kap  路  3Comments

grahamc picture grahamc  路  3Comments