Nixpkgs: Hydra can't evaluate nixos:trunk-combined

Created on 16 Jul 2019  路  16Comments  路  Source: NixOS/nixpkgs

hydra-eval-jobs returned signal 134:

which means out-of-memory killer, most likely (I think). The ratio of successes has been rather bad several days ago already, but now we seem to be completely stuck (many attempts without any success).

Among other issues, it blocks security updates from getting to the nixos-unstable channel.

blocker security

Most helpful comment

We're going to get a new machine for Hydra.

  1. The new one will run all of the things the current one, minus postgres
  2. The old one will continue running postgres, so it'll be able to use up all the system's RAM for cache
  3. The old one has 32G RAM
  4. The new one has 64G of RAM

We should be able to better evaluate then.

All 16 comments

In metrics I can't see any recent anomaly around memory consumption, so I expect it's either something around NixOS stuff (like OS, tests, etc.)... or we just very slowly got over the limit.

I have pinged @rbvermaa and @edolstra.

For the record, it's been weeks since it started happening with regularity.

Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS

I've been hapazardly requeuing the eval a bunch lately, due to that. It eventually finishes to eval, but seems like it's getting (anecdotally) harder for it to pass.

And for the record, OOM will not be 134, but signal 9, AFAICT. 134 is boehm GC failing.

And... well, sorry for adding again, this, or a linked issue, has been stopping us from adding aarch64-linux to supported systems. Figuring a solution to this is likely going to unblock us from doing it.

Oh, right, I miscalculated... 134 is probably abort() (e.g. via assert).

We're going to get a new machine for Hydra.

  1. The new one will run all of the things the current one, minus postgres
  2. The old one will continue running postgres, so it'll be able to use up all the system's RAM for cache
  3. The old one has 32G RAM
  4. The new one has 64G of RAM

We should be able to better evaluate then.

I hope that will really last us for years.

I'm not sure what's the time plan, but in the meantime... remove aarch64 from this jobset? (trunk jobset will still us provide with binaries and regression visibility for the larger parts)

@grahamc any idea when that is approximately going to happen?

We've ordered the server. Hopefully we'll get it today.

I hope that will really last us for years.

Years, or the addition of armv6l, riscv, and power9 :wink:

The server took a while to get provisioned, but we now have access to it.

We asked for it to be provisioned very nearby to the current hydra server, to:

1) make transferring the state faster
2) reduce latency from hydra to the postgres database

We suspect this special request made it take a longer than usual amount of time.

... but hydra.nixos.org hasn't been migrated so far, right? (To be clear, I don't mean that as critique or anything.)

Correct. We only just took control of the chassis, and we'll begin provisioning today. Probably after CEST work hours :)

We started transferring GC roots and derivations over to the new server several hours ago.

The new server is up and running.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

domenkozar picture domenkozar  路  3Comments

copumpkin picture copumpkin  路  3Comments

tomberek picture tomberek  路  3Comments

ob7 picture ob7  路  3Comments

langston-barrett picture langston-barrett  路  3Comments